[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816199#comment-17816199
 ] 

Tim Allison commented on TIKA-4194:
---

Right, sorry, misunderstood, here's the magic: 
https://github.com/apache/tika/blob/7d48d00ac1febfb1ac70e4887268b28fb4951b78/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L5236

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816198#comment-17816198
 ] 

Tim Allison commented on TIKA-3784:
---

I've just attached a dump of running that on all the *.p12 files in the repo 
mentioned on TIKA-4194

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-3784:
--
Attachment: dump_p12s.txt

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
> Attachments: dump_p12s.txt
>
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816193#comment-17816193
 ] 

Tim Allison edited comment on TIKA-3784 at 2/9/24 7:27 PM:
---

ObjectIdentifiers? 1.2.840.113549.1.7.1 
(https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, 
nothing specific to pkcs12?

Possibly useful? 
https://learn.microsoft.com/en-us/windows/win32/api/wincrypt/ns-wincrypt-crypt_algorithm_identifier


was (Author: talli...@mitre.org):
ObjectIdentifiers? 1.2.840.113549.1.7.1 
(https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, 
nothing specific to pkcs12?

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816193#comment-17816193
 ] 

Tim Allison commented on TIKA-3784:
---

ObjectIdentifiers? 1.2.840.113549.1.7.1 
(https://oidref.com/1.2.840.113549.1.7.1), but that's just pkcs7 data type, 
nothing specific to pkcs12?

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816192#comment-17816192
 ] 

Tim Allison commented on TIKA-3784:
---

Or verbose mode:

{noformat}
Sequence
Integer(3)
Sequence
ObjectIdentifier(1.2.840.113549.1.7.1)
Tagged [CONTEXT 0]
DER Octet String[2603] 

30820a273082043d06092a864886f70d010706a082042e3082042a02010030820'0=*H.0*0

042306092a864886f70d01070130819106092a864886f70d01050d3081833062#*H0*H00b

06092a864886f70d01050c305504409230fd49789d0b306756c36540fc618f09
*H0U@0Ix0gVe@a

05fc94d32f934c12312e92b80a737058f3919dc41b024b14c9a3eb45922628b5/L1.spXKE&(

14f013842a44af174b26b7f30dd88402030f4240300c06082a864886f70d020b*DK@0*H

0500301d060960864801650304012a0410121c481cab9a38375a29a8eb4d47030`He*H87Z)MG

f380820380b4abd22fed0e82635ae69b52d94ea261ce899cb7ceaa833706d6e6/cZRNa7

6c8a20fccd910e314306f737506a41a83b87db4ada1aec0bf274eccf7743e4b3l 
1C7PjA;JtwC

f4e939f34d5b61a87521e5ac5fd7a558791d0de123097ccd020cf33e82767403
9M[au!_Xy#|>vt

7058be6836a97220209c99c18997dd4aaaf6376e8daea39e92baacec67b89a14pXh6r  J7ng

2b4ac3693ec47b51d8cc61b991da01f35628fdae4d1009dd3c8dbc01c062e706+Ji>{QaV(M/

1c6aa5834cf8dc0879f2cb1168be6449b53fbd6dc9e62ca8256e6da3a2d19906jLyhdI?m,%nm

5cdbdeaec0999743c9020f154155b717b6b1678bf987794e60608fefd0571ae1\CAUgyN``W

6f8123f30c45d02d19b2d9e854343ab37cd5798d94f63898601a37fd019c9bc8o#E-T4:|y8`7

881f6c6efebf91df9666a5e1b177e3deb8eb1ce4ed42b6a73ec771540d9dd419lnfwB>qT

f46d420c1ef3128a550a5cc5f36ef402f72710e8ac6a7661fb7d3387a5f84de5mBU\n'jva}3M

3c1424db71d44382a19b7a70f94e06aff9d2c88b213daaf437f6d524d22524c5
<$qCzpN!=7$%$

cb7d9d6ba2cea9ebcb99dffeb8fb0d9138075e0ef8f12035dabfe289c2d73448}k8^ 54H

e815bfb34eecd8d7d77a6beda1b9db71ff083e92afa92e32937c54c492c8847eNzkq>.2|T~

b1244ee5ee0ccbb2fbe0f7bf277cd3ea6c53123e96feea6866d6d8d9af9c4b94$N'|lS>hfK

bc07ee4a285459b6a221bf04f7bef730efb108db1a157e9a4d622122f440bf9fJ(TY!0~Mb!"@

248109387f9362ed59ed44686828f3ff060888bd34a76350aa680ac8b1b7e0c1$8bYDhh(4cPh

d0c05d2a3ee91a346413511b48362e58dffbeae55789ae179307032e9fabc033
]*>4dQH6.XW.3

8e2d1a16af308205e206092a864886f70d010701a08205d3048205cf308205cb-0*H0

308205c7060b2a864886f70d010c0a0102a082056c3082056830819106092a860*Hl0h0*

4886f70d01050d308183306206092a864886f70d01050c305504407e483ed9a8H00b*H0U@~H>

2f508286cc2b702fce42d3f49e1212899c2306da11428d0a2d0c5753842afea9/P+p/B#B-WS*

318e126f5a9210a35eef41201ade1621342275bbdc1e1203c5c3ac02030f42401oZ^A !4"uB@

300c06082a864886f70d020b0500301d060960864801650304012a04105e90cb0*H0`He*^

77a055e0be074a4e8f80f84aee048204d00e38277e485269d64f955a49ed5247
wUJNJ8'~HRiOZIRG

e5a24f6717b07a91d15574d46bd2b442d0a13ab34fec55be179bfdf6eae15eeeOgzUtkB:OU^

01aa261854275d2d93be49898b759880b8f47249b04d735803b6f16102da8909
']-IurIMsXa

cf74c73172a798a209943e65efec491a3bf09ccf022acdfb1e0bb6fd9a50a1d9t1r>eI;*P

bc01c848dbbb7c8e66fa0349ef29445ada763b62427b00fa87bc2846d1b9f690
H|fI)DZv;bB{(F

0066d4a6d2ead07f7b1b83f8a8f7ca69200a45a5f39ecb0476f9cf09d57b2e63f{i Ev{.c

da8a0faf0faf1fb0e924716695d9f207f8fc977ba7793e1aa65efe7652fdf83d$qf{y>^vR=

750738ce443365ae96b94f275ce0e13af5f2cb400722053a9f618d56f108314b
u8D3eO'\:@":aV1K

b7183342c26f9fde3a075d92597c509fd570bd9917d773921445e03eaf00fe24
3Bo:]Y|PpsE>$

48227e6d4492dc51e975dfde2fac15e8c9b1ee812958bd00200177228c39a0c1H"~mDQu/)X 
w"9

b8d1b92545153cfd3064dc8d464cfada545d7ed8b96b8cbeb58d4ecb19572633
%E<0dFLT]~kNW&3

82d86f727c4a5cbebf127c9de46e6566e61ef779922903c3df79d0c84743cf25
or|J\|nefy)yGC%

e9287c48c8535bf71cf6b0104160a762f403ce7da791e3c38781bcffc1537f16(|HS[A`b}S

bd6d9d9a55ac3475d0a82e8aa9d98e10525fc2795ebeef45c9bf50924059b01d
mU4u.R_y^EP@Y

f715839f37e44969f3ec248f19c030cdd0a693d5749f7686b704d27f1a52ca007Ii$0tvR

98d6e856d4efece298383aecfa82aa19bc98eef9e9f333a656479ffa0dd85c6eV8:3VG\n

df0fa6cd82d1c5115a507b9abebfb716d03729cc196d41382890482e0259b9f5ZP{7)mA8(H.Y

4016a1b718ac074ba88aa9a06d2dd6134578caeb7513b64a5e93c4970b8f9460@Km-ExuJ^`
 

[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816191#comment-17816191
 ] 

Tim Allison commented on TIKA-3784:
---

[~nick] (and cc [~tom_1st] from TIKA-4194), I agree that parsing these things 
would probably be best as a container detector. When I run AS1Dump on one of 
the p12 files, I get this:

{noformat}
Sequence
Integer(3)
Sequence
ObjectIdentifier(1.2.840.113549.1.7.1)
Tagged [CONTEXT 0]
DER Octet String[2603] 
Sequence
Sequence
Sequence
ObjectIdentifier(2.16.840.1.101.3.4.2.3)
NULL
DER Octet String[64] 
DER Octet String[64] 
Integer(100)
{noformat}

Is there anything in there I can use to detect p12?

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816132#comment-17816132
 ] 

Lonzak edited comment on TIKA-4194 at 2/9/24 5:52 PM:
--

I read a bit [more|https://stackoverflow.com/a/31451808/2311528]. The whole 
context is ASN.1 DER encoding. So it is not magic bytes but ASN.1 encoding...

"30 82" is followed by two further bytes that specify the length of the 
SEQUENCE in an explicit number. This enables the coding of objects with a 
length of up to 65535 (0x) bytes.

"30 80", on the other hand, signals the start of a SEQUENCE with an undefined 
length. The final length of the SEQUENCE is not specified in advance. Instead, 
the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker 
pair "00 00". This encoding method is typically used when the total length of 
the SEQUENCE is not known at the time of encoding or when it is practical to 
treat the data as a stream.

 

To cover both cases, one could define an additional rule or adjust the existing 
rule to be more flexible. Directly adapting the current rule to include 
{{0x3080}} could be challenging because the structure and logic behind the 
length indication and subsequent content are different. Instead, we might need 
to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, 
however, that detecting content with indefinite length is more challenging, as 
one may not be able to straightforwardly check for a specific byte sequence 
after {{{}0x3080{}}}.
{code:java}
[40/application/x-x509-key; format=der string 0 0x3080??]{code}
In this hypothetical rule, {{??}} stands for a placeholder, as the 
specific handling for content with indefinite length needs to be adjusted, 
possibly by implementing a logic that recognizes the end of the stream instead 
of relying on fixed byte patterns.


was (Author: tom_1st):
I read a bit more. The whole context is ASN.1 DER encoding.

"30 82" is followed by two further bytes that specify the length of the 
SEQUENCE in an explicit number. This enables the coding of objects with a 
length of up to 65535 (0x) bytes.

"30 80", on the other hand, signals the start of a SEQUENCE with an undefined 
length. The final length of the SEQUENCE is not specified in advance. Instead, 
the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker 
pair "00 00". This encoding method is typically used when the total length of 
the SEQUENCE is not known at the time of encoding or when it is practical to 
treat the data as a stream.

 

To cover both cases, one could define an additional rule or adjust the existing 
rule to be more flexible. Directly adapting the current rule to include 
{{0x3080}} could be challenging because the structure and logic behind the 
length indication and subsequent content are different. Instead, we might need 
to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, 
however, that detecting content with indefinite length is more challenging, as 
one may not be able to straightforwardly check for a specific byte sequence 
after {{{}0x3080{}}}.
{code:java}
[40/application/x-x509-key; format=der string 0 0x3080??]{code}
In this hypothetical rule, {{??}} stands for a placeholder, as the 
specific handling for content with indefinite length needs to be adjusted, 
possibly by implementing a logic that recognizes the end of the stream instead 
of relying on fixed byte patterns.

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4188) Add support for ARC files

2024-02-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816165#comment-17816165
 ] 

Hudson commented on TIKA-4188:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1502 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1502/])
TIKA-4188 (#1587) (github: 
[https://github.com/apache/tika/commit/7d48d00ac1febfb1ac70e4887268b28fb4951b78])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/detect/gzip/GZipSpecializationDetector.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/testARC.arc
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/resources/test-documents/example.arc.gz
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/main/java/org/apache/tika/parser/warc/WARCParser.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-webarchive-module/src/test/java/org/apache/tika/parser/warc/WARCParserTest.java


> Add support for ARC files
> -
>
> Key: TIKA-4188
> URL: https://issues.apache.org/jira/browse/TIKA-4188
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 3.0.0
>
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816132#comment-17816132
 ] 

Lonzak commented on TIKA-4194:
--

I read a bit more. The whole context is ASN.1 DER encoding.

"30 82" is followed by two further bytes that specify the length of the 
SEQUENCE in an explicit number. This enables the coding of objects with a 
length of up to 65535 (0x) bytes.

"30 80", on the other hand, signals the start of a SEQUENCE with an undefined 
length. The final length of the SEQUENCE is not specified in advance. Instead, 
the end of the SEQUENCE is marked by a special end-of-contents (EOC) marker 
pair "00 00". This encoding method is typically used when the total length of 
the SEQUENCE is not known at the time of encoding or when it is practical to 
treat the data as a stream.

 

To cover both cases, one could define an additional rule or adjust the existing 
rule to be more flexible. Directly adapting the current rule to include 
{{0x3080}} could be challenging because the structure and logic behind the 
length indication and subsequent content are different. Instead, we might need 
to add a new rule specifically targeting keystores with {{{}0x3080{}}}. Note, 
however, that detecting content with indefinite length is more challenging, as 
one may not be able to straightforwardly check for a specific byte sequence 
after {{{}0x3080{}}}.
{code:java}
[40/application/x-x509-key; format=der string 0 0x3080??]{code}
In this hypothetical rule, {{??}} stands for a placeholder, as the 
specific handling for content with indefinite length needs to be adjusted, 
possibly by implementing a logic that recognizes the end of the stream instead 
of relying on fixed byte patterns.

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Comment Edited] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816122#comment-17816122
 ] 

Lonzak edited comment on TIKA-4194 at 2/9/24 4:15 PM:
--

I did investigate a bit further - (however my knowledge in this area is quite 
limited):

Tika is indeed looking at the bytes - a working keystore has the following 
"Magic" matcher:

[40/application/x-x509-key; format=der string 0 0x3082020100 
0xFC]

If I open that file in a hex editor I can see:

 
{code:java}
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class)
{code}
This seems to match except for the FF and last 00 values. (Maybe these bytes 
are ignored?)

 

If I open a non working one I get:
{code:java}
0x 30 80 02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code}
So the 2nd hex number is different thus it is not a match I would guess. But 
the bits also seems to to be shifted?
{code:java}
0x 30 80   02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class){code}
So an approach could be to add the missing magic bytes to an existing/new Magic 
class?

 

So maybe a matcher:

{{magic=0x3080FF3080}}

would work?{{{}{}}}


was (Author: tom_1st):
I did investigate a bit further - (however my knowledge in this area is quite 
limited):

Tika is indeed looking at the bytes - a working keystore has the following 
"Magic" matcher:

[40/application/x-x509-key; format=der string 0 0x3082020100 
0xFC]

If I open that file in a hex editor I can see:

 
{code:java}
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class)
{code}
This seems to match except for the FF and last 00 values. (Maybe these bytes 
are ignored?)

 

If I open a non working one I get:
{code:java}
0x 30 80 02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code}
So the 2nd hex number is different thus it is not a match I would guess. But 
the bits also seems to to be shifted?
{code:java}
0x 30 80   02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class){code}
So an approach could be to add the missing magic bytes to an existing/new Magic 
class?

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816122#comment-17816122
 ] 

Lonzak commented on TIKA-4194:
--

I did investigate a bit further - (however my knowledge in this area is quite 
limited):

Tika is indeed looking at the bytes - a working keystore has the following 
"Magic" matcher:

[40/application/x-x509-key; format=der string 0 0x3082020100 
0xFC]

If I open that file in a hex editor I can see:

 
{code:java}
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class)
{code}
This seems to match except for the FF and last 00 values. (Maybe these bytes 
are ignored?)

 

If I open a non working one I get:
{code:java}
0x 30 80 02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 FF FF 02 01 00 (magic bytes from the Magic class){code}
So the 2nd hex number is different thus it is not a match I would guess. But 
the bits also seems to to be shifted?
{code:java}
0x 30 80   02 01 03 30 80 (Bits from a non working keystore)
0x 30 82 10 29 02 01 03 30 82 (Bits from a working keystore)
0x 30 82 FF FF 02 01 00   (magic bytes from the Magic class){code}
So an approach could be to add the missing magic bytes to an existing/new Magic 
class?

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4188) Add support for ARC files

2024-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816121#comment-17816121
 ] 

ASF GitHub Bot commented on TIKA-4188:
--

tballison merged PR #1587:
URL: https://github.com/apache/tika/pull/1587




> Add support for ARC files
> -
>
> Key: TIKA-4188
> URL: https://issues.apache.org/jira/browse/TIKA-4188
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (TIKA-4188) Add support for ARC files

2024-02-09 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-4188.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add support for ARC files
> -
>
> Key: TIKA-4188
> URL: https://issues.apache.org/jira/browse/TIKA-4188
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
> Fix For: 3.0.0
>
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] TIKA-4188 [tika]

2024-02-09 Thread via GitHub


tballison merged PR #1587:
URL: https://github.com/apache/tika/pull/1587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816115#comment-17816115
 ] 

Tim Allison commented on TIKA-3784:
---

Based on this: 
https://stackoverflow.com/questions/30483489/how-to-decode-asn-1-data-in-java

If we parsed the ASN1, what would be look for? Does BouncyCastle have a 
detector?

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816113#comment-17816113
 ] 

Tim Allison commented on TIKA-4194:
---

Looks like "30 82" is the magic for DER X.509 certificates? 
https://en.wikipedia.org/wiki/List_of_file_signatures

Maybe this is useful: 
https://tls12.xargs.org/certificate.html#server-certificate-detail/annotated ?

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816111#comment-17816111
 ] 

Tim Allison commented on TIKA-4194:
---

Thank you for opening this!

Are you able to take a look here: 
https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4551
 

And maybe open a PR to update that?

It frankly looks like Tika is not even looking at the bytes. Do pkcs12 have a 
magic we can use for detection?

> tika fails to detect certain pkcs12 keystores types p12 pfx
> ---
>
> Key: TIKA-4194
> URL: https://issues.apache.org/jira/browse/TIKA-4194
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.9.1
>Reporter: Lonzak
>Priority: Major
>
> We use tika to detect the type of a file which is uploaded. In most cases 
> this works quite well. However recently some files were rejected because tika 
> reports an invalid file type. We'll get
> {code:java}
> APPLICATION/OCTET-STREAM{code}
> instead of
> {code:java}
> APPLICATION/X-X509-KEY{code}
> I did an analysis and found that tika doesn't recognize certain types of 
> pkcs12 keystores. The test keystores can be found 
> [here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].
> I created a list to show which ones are effected.  Out of 157 keystores 132 
> are correctly detected and 25 are not.
>  
> ||#||correct?||type||filename||
> |1|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |2|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |3|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |4|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |5|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
> |6|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |7|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |8|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |9|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |10|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |11|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
> |12|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |13|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |14|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |15|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |16|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |17|OK|APPLICATION/X-X509-KEY; 
> FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
> |18|OK|APPLICATION/X-X509-KEY; 
> 

[jira] [Commented] (TIKA-4188) Add support for ARC files

2024-02-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816095#comment-17816095
 ] 

ASF GitHub Bot commented on TIKA-4188:
--

tballison opened a new pull request, #1587:
URL: https://github.com/apache/tika/pull/1587

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   




> Add support for ARC files
> -
>
> Key: TIKA-4188
> URL: https://issues.apache.org/jira/browse/TIKA-4188
> Project: Tika
>  Issue Type: Improvement
>Reporter: Gregory Lepore
>Priority: Minor
>
> The original version of the Internet Archive's storage format is the ARC 
> format (later superseded by WARC and WACZ). 
> The ARC (Archive) format is a file format used for storing web archives. It 
> was developed by the Internet Archive to facilitate the mass storage of web 
> pages, capturing the content as it appeared on the Internet at specific 
> points in time. An ARC file is a single, large file that contains a sequence 
> of archived web resources. Each entry in an ARC file includes the URL of the 
> resource, the date it was captured, the HTTP response headers, and the 
> content of the resource itself (such as HTML pages, images, and other media 
> types).
> The structure of an ARC file generally consists of a file header followed by 
> a series of records, each representing an individual web resource. The ARC 
> file can be gzipped using a two step process where each record in the ARC 
> file is gzipped, and then the entire file is gzipped.
> The original ARC format specification is here:
> [https://archive.org/web/researcher/ArcFileFormat.php]
> The WARC format is currently supported via jwarc, which also appears to have 
> support for the ARC format (https://github.com/iipc/jwarc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] TIKA-4188 [tika]

2024-02-09 Thread via GitHub


tballison opened a new pull request, #1587:
URL: https://github.com/apache/tika/pull/1587

   
   
   Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Tika issue 
tracker](https://issues.apache.org/jira/projects/TIKA) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`TIKA-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[TIKA-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Tika is successfully built and unit tests pass by running `mvn clean test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* `main` branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled `main` branch
   * if you add new module that downstream users will depend upon add it to 
relevant group in `tika-bom/pom.xml`.
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Tika 
in general, please sign up for the [Tika mailing 
list](http://tika.apache.org/mail-lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tika, tika_exception

2024-02-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3841:
--
Summary: An exception occurred when parsing some word documents using tika, 
tika_exception  (was: An exception occurred when parsing some word documents 
using tikatika_exception)

> An exception occurred when parsing some word documents using tika, 
> tika_exception
> -
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3841) An exception occurred when parsing some word documents using tikatika_exception

2024-02-09 Thread Tilman Hausherr (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-3841:
--
Summary: An exception occurred when parsing some word documents using 
tikatika_exception  (was: 使用tika解析部分word文档出现异常,tika_exception)

> An exception occurred when parsing some word documents using 
> tikatika_exception
> ---
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3841) 使用tika解析部分word文档出现异常,tika_exception

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816035#comment-17816035
 ] 

Lonzak edited comment on TIKA-3841 at 2/9/24 12:20 PM:
---

My Chinese is a bit rusty so can someone change the title to: Exception when 
using tika to parse some Word documents, tika_exception ? Thanks


was (Author: tom_1st):
My chinese is a bit rusty so can someone change the title to: Exception when 
using tika to parse some Word documents, tika_exception ? Thanks

> 使用tika解析部分word文档出现异常,tika_exception
> ---
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3841) 使用tika解析部分word文档出现异常,tika_exception

2024-02-09 Thread Lonzak (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816035#comment-17816035
 ] 

Lonzak commented on TIKA-3841:
--

My chinese is a bit rusty so can someone change the title to: Exception when 
using tika to parse some Word documents, tika_exception ? Thanks

> 使用tika解析部分word文档出现异常,tika_exception
> ---
>
> Key: TIKA-3841
> URL: https://issues.apache.org/jira/browse/TIKA-3841
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.24, 2.4.1, 1.28.4
> Environment: h3. Java Version
> java version "1.8.0_291"
> h3. OS Version
> Linux localhost.localdomain 3.10.0-957.el7.x86_64 
> [#1|https://github.com/elastic/elasticsearch/issues/1] SMP Thu Nov 8 23:39:32 
> UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: lxz
>Priority: Blocker
>
> {
>     "error": {
>         "root_cause": [
> {             "type": "parse_exception",             "reason": "Error parsing 
> document in field [content]"         }
> ],
>         "type": "parse_exception",
>         "reason": "Error parsing document in field [content]",
>         "caused_by": {
>             "type": "tika_exception",
>             "reason": "Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@3b5e180a",
>             "caused_by":
> {                 "type": "array_index_out_of_bounds_exception",              
>    "reason": "351"             }
>         }
>     },
>     "status": 400
> }



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4194) tika fails to detect certain pkcs12 keystores types p12 pfx

2024-02-09 Thread Lonzak (Jira)
Lonzak created TIKA-4194:


 Summary: tika fails to detect certain pkcs12 keystores types p12 
pfx
 Key: TIKA-4194
 URL: https://issues.apache.org/jira/browse/TIKA-4194
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 2.9.1
Reporter: Lonzak


We use tika to detect the type of a file which is uploaded. In most cases this 
works quite well. However recently some files were rejected because tika 
reports an invalid file type. We'll get
{code:java}
APPLICATION/OCTET-STREAM{code}
instead of
{code:java}
APPLICATION/X-X509-KEY{code}
I did an analysis and found that tika doesn't recognize certain types of pkcs12 
keystores. The test keystores can be found 
[here|https://github.com/redhat-qe-security/keyfile-corpus/tree/master].

I created a list to show which ones are effected.  Out of 157 keystores 132 are 
correctly detected and 25 are not.

 
||#||correct?||type||filename||
|1|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|dsa(1024,sha1),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|2|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|dsa(1024,sha1),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|3|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|4|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|5|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(none),key(none).p12|
|6|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|ecdsa(P-256,sha256),cert(pbeWithSHAAnd40BitRC2-CBC,salt(8),iter(2048)),key(pbeWithSHAAnd3-KeyTripleDES-CBC,salt(8),iter(2048)),mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|7|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|8|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(0),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|9|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|10|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(16),iter(2048),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|11|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(64),iter(100),keyLen(default),prf(hmacWithSHA512)),aes-256-cbc(IV(16,mac(sha512,salt(64),iter(100)),pass(ascii).p12|
|12|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|13|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(1),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|14|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),aes-128-cbc(IV(16,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|15|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(100),keyLen(default),prf(default)),des-ede3-cbc(IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|16|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(default)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|17|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(16),prf(hmacWithSHA256)),rc2-cbc(keyBits(56=128bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|18|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(default)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|19|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(5),prf(hmacWithSHA256)),rc2-cbc(keyBits(160=40bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|20|OK|APPLICATION/X-X509-KEY; 
FORMAT=DER|rsa(2048,sha256),cert(PBES2(PBKDF2(salt(8),iter(2048),keyLen(8),prf(default)),rc2-cbc(keyBits(120=64bit),IV(8,mac(sha1,salt(8),iter(2048)),pass(ascii).p12|
|21|OK|APPLICATION/X-X509-KEY; 

[jira] [Commented] (TIKA-4166) dependency updates for Tika 3.0

2024-02-09 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816010#comment-17816010
 ] 

Hudson commented on TIKA-4166:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk11 #1501 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1501/])
TIKA-4166: update jwarc (tilman: 
[https://github.com/apache/tika/commit/7abd05d99caf10d0752db4f36b0fe87214d25394])
* (edit) tika-parent/pom.xml


> dependency updates for Tika 3.0
> ---
>
> Key: TIKA-4166
> URL: https://issues.apache.org/jira/browse/TIKA-4166
> Project: Tika
>  Issue Type: Task
>  Components: build
>Reporter: Tilman Hausherr
>Priority: Minor
> Fix For: 3.0.0-BETA
>
>
> Separate ticket for updates for 3.0, especially those not found by dependabot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)