[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816199#comment-17816199 ] Tim Allison commented on TIKA-4194: --- Right, sorry, misunderstood, here's the magic: https://github.com/apache/tika/blob/7d48d00ac1febfb1ac70e4887268b28fb4951b78/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L5236 > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |8|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |9|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |10|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |11|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| > |12|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |13|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |14|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |15|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |16|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |17|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |18|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816198#comment-17816198 ] Tim Allison commented on TIKA-3784: --- I've just attached a dump of running that on all the *.p12 files in the repo mentioned on TIKA-4194 > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > Attachments: dump_p12s.txt > > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-3784: -- Attachment: dump_p12s.txt > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > Attachments: dump_p12s.txt > > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816193#comment-17816193 ] Tim Allison edited comment on TIKA-3784 at 2/9/24 7:27 PM: --- ObjectIdentifiers? 1.2.840.113549.1.7.1 (https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, nothing specific to pkcs12? Possibly useful? https://learn.microsoft.com/en-us/windows/win32/api/wincrypt/ns-wincrypt-crypt_algorithm_identifier was (Author: talli...@mitre.org): ObjectIdentifiers? 1.2.840.113549.1.7.1 (https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, nothing specific to pkcs12? > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816193#comment-17816193 ] Tim Allison commented on TIKA-3784: --- ObjectIdentifiers? 1.2.840.113549.1.7.1 (https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, nothing specific to pkcs12? > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816192#comment-17816192 ] Tim Allison commented on TIKA-3784: --- Or verbose mode: {noformat} Sequence Integer(3) Sequence ObjectIdentifier(1.2.840.113549.1.7.1) Tagged [CONTEXT 0] DER Octet String[2603] 30820a273082043d06092a864886f70d010706a082042e3082042a02010030820'0=*H.0*0 042306092a864886f70d01070130819106092a864886f70d01050d3081833062#*H0*H00b 06092a864886f70d01050c305504409230fd49789d0b306756c36540fc618f09 *H0U@0Ix0gVe@a 05fc94d32f934c12312e92b80a737058f3919dc41b024b14c9a3eb45922628b5/L1.spXKE&( 14f013842a44af174b26b7f30dd88402030f4240300c06082a864886f70d020b*DK@0*H 0500301d060960864801650304012a0410121c481cab9a38375a29a8eb4d47030`He*H87Z)MG f380820380b4abd22fed0e82635ae69b52d94ea261ce899cb7ceaa833706d6e6/cZRNa7 6c8a20fccd910e314306f737506a41a83b87db4ada1aec0bf274eccf7743e4b3l 1C7PjA;JtwC f4e939f34d5b61a87521e5ac5fd7a558791d0de123097ccd020cf33e82767403 9M[au!_Xy#|>vt 7058be6836a97220209c99c18997dd4aaaf6376e8daea39e92baacec67b89a14pXh6r J7ng 2b4ac3693ec47b51d8cc61b991da01f35628fdae4d1009dd3c8dbc01c062e706+Ji>{QaV(M/ 1c6aa5834cf8dc0879f2cb1168be6449b53fbd6dc9e62ca8256e6da3a2d19906jLyhdI?m,%nm 5cdbdeaec0999743c9020f154155b717b6b1678bf987794e60608fefd0571ae1\CAUgyN``W 6f8123f30c45d02d19b2d9e854343ab37cd5798d94f63898601a37fd019c9bc8o#E-T4:|y8`7 881f6c6efebf91df9666a5e1b177e3deb8eb1ce4ed42b6a73ec771540d9dd419lnfwB>qT f46d420c1ef3128a550a5cc5f36ef402f72710e8ac6a7661fb7d3387a5f84de5mBU\n'jva}3M 3c1424db71d44382a19b7a70f94e06aff9d2c88b213daaf437f6d524d22524c5 <$qCzpN!=7$%$ cb7d9d6ba2cea9ebcb99dffeb8fb0d9138075e0ef8f12035dabfe289c2d73448}k8^ 54H e815bfb34eecd8d7d77a6beda1b9db71ff083e92afa92e32937c54c492c8847eNzkq>.2|T~ b1244ee5ee0ccbb2fbe0f7bf277cd3ea6c53123e96feea6866d6d8d9af9c4b94$N'|lS>hfK bc07ee4a285459b6a221bf04f7bef730efb108db1a157e9a4d622122f440bf9fJ(TY!0~Mb!"@ 248109387f9362ed59ed44686828f3ff060888bd34a76350aa680ac8b1b7e0c1$8bYDhh(4cPh d0c05d2a3ee91a346413511b48362e58dffbeae55789ae179307032e9fabc033 ]*>4dQH6.XW.3 8e2d1a16af308205e206092a864886f70d010701a08205d3048205cf308205cb-0*H0 308205c7060b2a864886f70d010c0a0102a082056c3082056830819106092a860*Hl0h0* 4886f70d01050d308183306206092a864886f70d01050c305504407e483ed9a8H00b*H0U@~H> 2f508286cc2b702fce42d3f49e1212899c2306da11428d0a2d0c5753842afea9/P+p/B#B-WS* 318e126f5a9210a35eef41201ade1621342275bbdc1e1203c5c3ac02030f42401oZ^A !4"uB@ 300c06082a864886f70d020b0500301d060960864801650304012a04105e90cb0*H0`He*^ 77a055e0be074a4e8f80f84aee048204d00e38277e485269d64f955a49ed5247 wUJNJ8'~HRiOZIRG e5a24f6717b07a91d15574d46bd2b442d0a13ab34fec55be179bfdf6eae15eeeOgzUtkB:OU^ 01aa261854275d2d93be49898b759880b8f47249b04d735803b6f16102da8909 ']-IurIMsXa cf74c73172a798a209943e65efec491a3bf09ccf022acdfb1e0bb6fd9a50a1d9t1r>eI;*P bc01c848dbbb7c8e66fa0349ef29445ada763b62427b00fa87bc2846d1b9f690 H|fI)DZv;bB{(F 0066d4a6d2ead07f7b1b83f8a8f7ca69200a45a5f39ecb0476f9cf09d57b2e63f{i Ev{.c da8a0faf0faf1fb0e924716695d9f207f8fc977ba7793e1aa65efe7652fdf83d$qf{y>^vR= 750738ce443365ae96b94f275ce0e13af5f2cb400722053a9f618d56f108314b u8D3eO'\:@":aV1K b7183342c26f9fde3a075d92597c509fd570bd9917d773921445e03eaf00fe24 3Bo:]Y|PpsE>$ 48227e6d4492dc51e975dfde2fac15e8c9b1ee812958bd00200177228c39a0c1H"~mDQu/)X w"9 b8d1b92545153cfd3064dc8d464cfada545d7ed8b96b8cbeb58d4ecb19572633 %E<0dFLT]~kNW&3 82d86f727c4a5cbebf127c9de46e6566e61ef779922903c3df79d0c84743cf25 or|J\|nefy)yGC% e9287c48c8535bf71cf6b0104160a762f403ce7da791e3c38781bcffc1537f16(|HS[A`b}S bd6d9d9a55ac3475d0a82e8aa9d98e10525fc2795ebeef45c9bf50924059b01d mU4u.R_y^EP@Y f715839f37e44969f3ec248f19c030cdd0a693d5749f7686b704d27f1a52ca007Ii$0tvR 98d6e856d4efece298383aecfa82aa19bc98eef9e9f333a656479ffa0dd85c6eV8:3VG\n df0fa6cd82d1c5115a507b9abebfb716d03729cc196d41382890482e0259b9f5ZP{7)mA8(H.Y 4016a1b718ac074ba88aa9a06d2dd6134578caeb7513b64a5e93c4970b8f9460@Km-ExuJ^`
[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816191#comment-17816191 ] Tim Allison commented on TIKA-3784: --- [~nick] (and cc [~tom_1st] from TIKA-4194), I agree that parsing these things would probably be best as a container detector. When I run AS1Dump on one of the p12 files, I get this: {noformat} Sequence Integer(3) Sequence ObjectIdentifier(1.2.840.113549.1.7.1) Tagged [CONTEXT 0] DER Octet String[2603] Sequence Sequence Sequence ObjectIdentifier(2.16.840.1.101.3.4.2.3) NULL DER Octet String[64] DER Octet String[64] Integer(100) {noformat} Is there anything in there I can use to detect p12? > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816132#comment-17816132 ] Lonzak edited comment on TIKA-4194 at 2/9/24 5:52 PM: -- I read a bit [more|https://stackoverflow.com/a/31451808/2311528]. The whole context is ASN.1 DER encoding. So it is not magic bytes but ASN.1 encoding... "30 82" is followed by two further bytes that specify the length of the SEQUENCE in an explicit number. This enables the coding of objects with a length of up to 65535 (0x) bytes. "30 80", on the other hand, signals the start of a SEQUENCE with an undefined length. The final length of the SEQUENCE is not specified in advance. Instead, the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker pair "00 00". This encoding method is typically used when the total length of the SEQUENCE is not known at the time of encoding or when it is practical to treat the data as a stream. To cover both cases, one could define an additional rule or adjust the existing rule to be more flexible. Directly adapting the current rule to include {{0x3080}} could be challenging because the structure and logic behind the length indication and subsequent content are different. Instead, we might need to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, however, that detecting content with indefinite length is more challenging, as one may not be able to straightforwardly check for a specific byte sequence after {{{}0x3080{}}}. {code:java} [40/application/x-x509-key; format=der string 0 0x3080??]{code} In this hypothetical rule, {{??}} stands for a placeholder, as the specific handling for content with indefinite length needs to be adjusted, possibly by implementing a logic that recognizes the end of the stream instead of relying on fixed byte patterns. was (Author: tom_1st): I read a bit more. The whole context is ASN.1 DER encoding. "30 82" is followed by two further bytes that specify the length of the SEQUENCE in an explicit number. This enables the coding of objects with a length of up to 65535 (0x) bytes. "30 80", on the other hand, signals the start of a SEQUENCE with an undefined length. The final length of the SEQUENCE is not specified in advance. Instead, the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker pair "00 00". This encoding method is typically used when the total length of the SEQUENCE is not known at the time of encoding or when it is practical to treat the data as a stream. To cover both cases, one could define an additional rule or adjust the existing rule to be more flexible. Directly adapting the current rule to include {{0x3080}} could be challenging because the structure and logic behind the length indication and subsequent content are different. Instead, we might need to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, however, that detecting content with indefinite length is more challenging, as one may not be able to straightforwardly check for a specific byte sequence after {{{}0x3080{}}}. {code:java} [40/application/x-x509-key; format=der string 0 0x3080??]{code} In this hypothetical rule, {{??}} stands for a placeholder, as the specific handling for content with indefinite length needs to be adjusted, possibly by implementing a logic that recognizes the end of the stream instead of relying on fixed byte patterns. > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; >
[jira] [Commented] (TIKA-4188) Add support for ARC files
[ https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816165#comment-17816165 ] Hudson commented on TIKA-4188: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1502 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1502/]) TIKA-4188 (#1587) (github: [https://github.com/apache/tika/commit/7d48d00ac1febfb1ac70e4887268b28fb4951b78]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/detect/gzip/GZipSpecializationDetector.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/testARC.arc * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/example.arc.gz * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/main/java/org/apache/tika/parser/warc/WARCParser.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/java/org/apache/tika/parser/warc/WARCParserTest.java > Add support for ARC files > - > > Key: TIKA-4188 > URL: https://issues.apache.org/jira/browse/TIKA-4188 > Project: Tika > Issue Type: Improvement >Reporter: Gregory Lepore >Priority: Minor > Fix For: 3.0.0 > > > The original version of the Internet Archive's storage format is the ARC > format (later superseded by WARC and WACZ). > The ARC (Archive) format is a file format used for storing web archives. It > was developed by the Internet Archive to facilitate the mass storage of web > pages, capturing the content as it appeared on the Internet at specific > points in time. An ARC file is a single, large file that contains a sequence > of archived web resources. Each entry in an ARC file includes the URL of the > resource, the date it was captured, the HTTP response headers, and the > content of the resource itself (such as HTML pages, images, and other media > types). > The structure of an ARC file generally consists of a file header followed by > a series of records, each representing an individual web resource. The ARC > file can be gzipped using a two step process where each record in the ARC > file is gzipped, and then the entire file is gzipped. > The original ARC format specification is here: > [https://archive.org/web/researcher/ArcFileFormat.php] > The WARC format is currently supported via jwarc, which also appears to have > support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816132#comment-17816132 ] Lonzak commented on TIKA-4194: -- I read a bit more. The whole context is ASN.1 DER encoding. "30 82" is followed by two further bytes that specify the length of the SEQUENCE in an explicit number. This enables the coding of objects with a length of up to 65535 (0x) bytes. "30 80", on the other hand, signals the start of a SEQUENCE with an undefined length. The final length of the SEQUENCE is not specified in advance. Instead, the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker pair "00 00". This encoding method is typically used when the total length of the SEQUENCE is not known at the time of encoding or when it is practical to treat the data as a stream. To cover both cases, one could define an additional rule or adjust the existing rule to be more flexible. Directly adapting the current rule to include {{0x3080}} could be challenging because the structure and logic behind the length indication and subsequent content are different. Instead, we might need to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, however, that detecting content with indefinite length is more challenging, as one may not be able to straightforwardly check for a specific byte sequence after {{{}0x3080{}}}. {code:java} [40/application/x-x509-key; format=der string 0 0x3080??]{code} In this hypothetical rule, {{??}} stands for a placeholder, as the specific handling for content with indefinite length needs to be adjusted, possibly by implementing a logic that recognizes the end of the stream instead of relying on fixed byte patterns. > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |8|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |9|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |10|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |11|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| > |12|OK|APPLICATION/X-X509-KEY; >
[jira] [Comment Edited] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816122#comment-17816122 ] Lonzak edited comment on TIKA-4194 at 2/9/24 4:15 PM: -- I did investigate a bit further - (however my knowledge in this area is quite limited): Tika is indeed looking at the bytes - a working keystore has the following "Magic" matcher: [40/application/x-x509-key; format=der string 0 0x3082020100 0xFC] If I open that file in a hex editor I can see: {code:java} 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class) {code} This seems to match except for the FF and last 00 values. (Maybe these bytes are ignored?) If I open a non working one I get: {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So the 2nd hex number is different thus it is not a match I would guess. But the bits also seems to to be shifted? {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So an approach could be to add the missing magic bytes to an existing/new Magic class? So maybe a matcher: {{magic=0x3080FF3080}} would work?{{{}{}}} was (Author: tom_1st): I did investigate a bit further - (however my knowledge in this area is quite limited): Tika is indeed looking at the bytes - a working keystore has the following "Magic" matcher: [40/application/x-x509-key; format=der string 0 0x3082020100 0xFC] If I open that file in a hex editor I can see: {code:java} 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class) {code} This seems to match except for the FF and last 00 values. (Maybe these bytes are ignored?) If I open a non working one I get: {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So the 2nd hex number is different thus it is not a match I would guess. But the bits also seems to to be shifted? {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So an approach could be to add the missing magic bytes to an existing/new Magic class? > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| >
[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816122#comment-17816122 ] Lonzak commented on TIKA-4194: -- I did investigate a bit further - (however my knowledge in this area is quite limited): Tika is indeed looking at the bytes - a working keystore has the following "Magic" matcher: [40/application/x-x509-key; format=der string 0 0x3082020100 0xFC] If I open that file in a hex editor I can see: {code:java} 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class) {code} This seems to match except for the FF and last 00 values. (Maybe these bytes are ignored?) If I open a non working one I get: {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So the 2nd hex number is different thus it is not a match I would guess. But the bits also seems to to be shifted? {code:java} 0x 30 80 02 01 03 30 80 (Bits from a non working keystore) 0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore) 0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code} So an approach could be to add the missing magic bytes to an existing/new Magic class? > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |8|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |9|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |10|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |11|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| > |12|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |13|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |14|OK|APPLICATION/X-X509-KEY; >
[jira] [Commented] (TIKA-4188) Add support for ARC files
[ https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816121#comment-17816121 ] ASF GitHub Bot commented on TIKA-4188: -- tballison merged PR #1587: URL: https://github.com/apache/tika/pull/1587 > Add support for ARC files > - > > Key: TIKA-4188 > URL: https://issues.apache.org/jira/browse/TIKA-4188 > Project: Tika > Issue Type: Improvement >Reporter: Gregory Lepore >Priority: Minor > > The original version of the Internet Archive's storage format is the ARC > format (later superseded by WARC and WACZ). > The ARC (Archive) format is a file format used for storing web archives. It > was developed by the Internet Archive to facilitate the mass storage of web > pages, capturing the content as it appeared on the Internet at specific > points in time. An ARC file is a single, large file that contains a sequence > of archived web resources. Each entry in an ARC file includes the URL of the > resource, the date it was captured, the HTTP response headers, and the > content of the resource itself (such as HTML pages, images, and other media > types). > The structure of an ARC file generally consists of a file header followed by > a series of records, each representing an individual web resource. The ARC > file can be gzipped using a two step process where each record in the ARC > file is gzipped, and then the entire file is gzipped. > The original ARC format specification is here: > [https://archive.org/web/researcher/ArcFileFormat.php] > The WARC format is currently supported via jwarc, which also appears to have > support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (TIKA-4188) Add support for ARC files
[ https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison resolved TIKA-4188. --- Fix Version/s: 3.0.0 Resolution: Fixed > Add support for ARC files > - > > Key: TIKA-4188 > URL: https://issues.apache.org/jira/browse/TIKA-4188 > Project: Tika > Issue Type: Improvement >Reporter: Gregory Lepore >Priority: Minor > Fix For: 3.0.0 > > > The original version of the Internet Archive's storage format is the ARC > format (later superseded by WARC and WACZ). > The ARC (Archive) format is a file format used for storing web archives. It > was developed by the Internet Archive to facilitate the mass storage of web > pages, capturing the content as it appeared on the Internet at specific > points in time. An ARC file is a single, large file that contains a sequence > of archived web resources. Each entry in an ARC file includes the URL of the > resource, the date it was captured, the HTTP response headers, and the > content of the resource itself (such as HTML pages, images, and other media > types). > The structure of an ARC file generally consists of a file header followed by > a series of records, each representing an individual web resource. The ARC > file can be gzipped using a two step process where each record in the ARC > file is gzipped, and then the entire file is gzipped. > The original ARC format specification is here: > [https://archive.org/web/researcher/ArcFileFormat.php] > The WARC format is currently supported via jwarc, which also appears to have > support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] TIKA-4188 [tika]
tballison merged PR #1587: URL: https://github.com/apache/tika/pull/1587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file
[ https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816115#comment-17816115 ] Tim Allison commented on TIKA-3784: --- Based on this: https://stackoverflow.com/questions/30483489/how-to-decode-asn-1-data-in-java If we parsed the ASN1, what would be look for? Does BouncyCastle have a detector? > Detector returns "application/x-x509-key" when scanning a .p12 file > --- > > Key: TIKA-3784 > URL: https://issues.apache.org/jira/browse/TIKA-3784 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 1.26 >Reporter: Matthias Hofbauer >Priority: Critical > > We are using tika to check if the MIME type of the file extensions matches > with the MIME type of the file content. > After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore > for certificates of type .p12, .pfx, .cer, .der. > For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but > the tika detector returns "application/x-x509-key" instead. > After checking the tika-mimetype.xml and comparing it to my .p12 file I found > the following MIME magic which explains why I got these types back. > {code:xml} > > > > > > > mask="0x00FC" offset="0"/> > mask="0xFC" offset="0"/> > > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816113#comment-17816113 ] Tim Allison commented on TIKA-4194: --- Looks like "30 82" is the magic for DER X.509 certificates? https://en.wikipedia.org/wiki/List_of_file_signatures Maybe this is useful: https://tls12.xargs.org/certificate.html#server-certificate-detail/annotated ? > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |8|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |9|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |10|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |11|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| > |12|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |13|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |14|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |15|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |16|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |17|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |18|OK|APPLICATION/X-X509-KEY; >
[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
[ https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816111#comment-17816111 ] Tim Allison commented on TIKA-4194: --- Thank you for opening this! Are you able to take a look here: https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4551 And maybe open a PR to update that? It frankly looks like Tika is not even looking at the bytes. Do pkcs12 have a magic we can use for detection? > tika fails to detect certain pkcs12 keystores types p12 pfx > --- > > Key: TIKA-4194 > URL: https://issues.apache.org/jira/browse/TIKA-4194 > Project: Tika > Issue Type: Bug > Components: detector >Affects Versions: 2.9.1 >Reporter: Lonzak >Priority: Major > > We use tika to detect the type of a file which is uploaded. In most cases > this works quite well. However recently some files were rejected because tika > reports an invalid file type. We'll get > {code:java} > APPLICATION/OCTET-STREAM{code} > instead of > {code:java} > APPLICATION/X-X509-KEY{code} > I did an analysis and found that tika doesn't recognize certain types of > pkcs12 keystores. The test keystores can be found > [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. > I created a list to show which ones are effected. Out of 157 keystores 132 > are correctly detected and 25 are not. > > ||#||correct?||type||filename|| > |1|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |2|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |3|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |4|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |5|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| > |6|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |7|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |8|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |9|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |10|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |11|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| > |12|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |13|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |14|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |15|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |16|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |17|OK|APPLICATION/X-X509-KEY; > FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| > |18|OK|APPLICATION/X-X509-KEY; >
[jira] [Commented] (TIKA-4188) Add support for ARC files
[ https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816095#comment-17816095 ] ASF GitHub Bot commented on TIKA-4188: -- tballison opened a new pull request, #1587: URL: https://github.com/apache/tika/pull/1587 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! > Add support for ARC files > - > > Key: TIKA-4188 > URL: https://issues.apache.org/jira/browse/TIKA-4188 > Project: Tika > Issue Type: Improvement >Reporter: Gregory Lepore >Priority: Minor > > The original version of the Internet Archive's storage format is the ARC > format (later superseded by WARC and WACZ). > The ARC (Archive) format is a file format used for storing web archives. It > was developed by the Internet Archive to facilitate the mass storage of web > pages, capturing the content as it appeared on the Internet at specific > points in time. An ARC file is a single, large file that contains a sequence > of archived web resources. Each entry in an ARC file includes the URL of the > resource, the date it was captured, the HTTP response headers, and the > content of the resource itself (such as HTML pages, images, and other media > types). > The structure of an ARC file generally consists of a file header followed by > a series of records, each representing an individual web resource. The ARC > file can be gzipped using a two step process where each record in the ARC > file is gzipped, and then the entire file is gzipped. > The original ARC format specification is here: > [https://archive.org/web/researcher/ArcFileFormat.php] > The WARC format is currently supported via jwarc, which also appears to have > support for the ARC format (https://github.com/iipc/jwarc) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] TIKA-4188 [tika]
tballison opened a new pull request, #1587: URL: https://github.com/apache/tika/pull/1587 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch * if you add new module that downstream users will depend upon add it to relevant group in `tika-bom/pom.xml`. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tika, tika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3841: -- Summary: An exception occurred when parsing some word documents using tika, tika_exception (was: An exception occurred when parsing some word documents using tikatika_exception) > An exception occurred when parsing some word documents using tika, > tika_exception > - > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tikatika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tilman Hausherr updated TIKA-3841: -- Summary: An exception occurred when parsing some word documents using tikatika_exception (was: 使用tika解析部分word文档出现异常,tika_exception) > An exception occurred when parsing some word documents using > tikatika_exception > --- > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (TIKA-3841) 使用tika解析部分word文档出现异常,tika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816035#comment-17816035 ] Lonzak edited comment on TIKA-3841 at 2/9/24 12:20 PM: --- My Chinese is a bit rusty so can someone change the title to: Exception when using tika to parse some Word documents, tika_exception ? Thanks was (Author: tom_1st): My chinese is a bit rusty so can someone change the title to: Exception when using tika to parse some Word documents, tika_exception ? Thanks > 使用tika解析部分word文档出现异常,tika_exception > --- > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3841) 使用tika解析部分word文档出现异常,tika_exception
[ https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816035#comment-17816035 ] Lonzak commented on TIKA-3841: -- My chinese is a bit rusty so can someone change the title to: Exception when using tika to parse some Word documents, tika_exception ? Thanks > 使用tika解析部分word文档出现异常,tika_exception > --- > > Key: TIKA-3841 > URL: https://issues.apache.org/jira/browse/TIKA-3841 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.24, 2.4.1, 1.28.4 > Environment: h3. Java Version > java version "1.8.0_291" > h3. OS Version > Linux localhost.localdomain 3.10.0-957.el7.x86_64 > [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 > UTC 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: lxz >Priority: Blocker > > { > "error": { > "root_cause": [ > { "type": "parse_exception", "reason": "Error parsing > document in field [content]" } > ], > "type": "parse_exception", > "reason": "Error parsing document in field [content]", > "caused_by": { > "type": "tika_exception", > "reason": "Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@3b5e180a", > "caused_by": > { "type": "array_index_out_of_bounds_exception", > "reason": "351" } > } > }, > "status": 400 > } -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx
Lonzak created TIKA-4194: Summary: tika fails to detect certain pkcs12 keystores types p12 pfx Key: TIKA-4194 URL: https://issues.apache.org/jira/browse/TIKA-4194 Project: Tika Issue Type: Bug Components: detector Affects Versions: 2.9.1 Reporter: Lonzak We use tika to detect the type of a file which is uploaded. In most cases this works quite well. However recently some files were rejected because tika reports an invalid file type. We'll get {code:java} APPLICATION/OCTET-STREAM{code} instead of {code:java} APPLICATION/X-X509-KEY{code} I did an analysis and found that tika doesn't recognize certain types of pkcs12 keystores. The test keystores can be found [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master]. I created a list to show which ones are effected. Out of 157 keystores 132 are correctly detected and 25 are not. ||#||correct?||type||filename|| |1|OK|APPLICATION/X-X509-KEY; FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |2|OK|APPLICATION/X-X509-KEY; FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |3|OK|APPLICATION/X-X509-KEY; FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |4|OK|APPLICATION/X-X509-KEY; FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |5|OK|APPLICATION/X-X509-KEY; FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12| |6|OK|APPLICATION/X-X509-KEY; FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |7|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |8|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |9|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |10|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |11|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12| |12|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |13|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |14|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |15|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |16|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |17|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |18|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |19|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(hmacWithSHA256)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |20|OK|APPLICATION/X-X509-KEY; FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(8),prf(default)),rc2-cbc(keyBits(120=64bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12| |21|OK|APPLICATION/X-X509-KEY;
[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0
[ https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816010#comment-17816010 ] Hudson commented on TIKA-4166: -- SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1501 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1501/]) TIKA-4166: update jwarc (tilman: [https://github.com/apache/tika/commit/7abd05d99caf10d0752db4f36b0fe87214d25394]) * (edit) tika-parent/pom.xml > dependency updates for Tika 3.0 > --- > > Key: TIKA-4166 > URL: https://issues.apache.org/jira/browse/TIKA-4166 > Project: Tika > Issue Type: Task > Components: build >Reporter: Tilman Hausherr >Priority: Minor > Fix For: 3.0.0-BETA > > > Separate ticket for updates for 3.0, especially those not found by dependabot. -- This message was sent by Atlassian Jira (v8.20.10#820010)