[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550216#comment-17550216
 ] 

Nick Burch commented on TIKA-3768:
--

I wouldn't expect to find those in the textual content after parsing, those 
fields should be ending up in the Metadata object instead

We have a bunch of unit tests for mail parsing which shows that, for our test 
files at least, that subject + from + to all coming through, see 
[https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java]

Are you able to compare your code with that in the unit test, and see any 
differences between the working test and yours? Bonus marks if you can write a 
small failing junit unit test that shows the issue with your file

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550223#comment-17550223
 ] 

Nick Burch commented on TIKA-3784:
--

We don't currently have any Mime Magic for PKCS12 files

Based on 
[https://stackoverflow.com/questions/33239875/jks-bks-and-pkcs12-file-formats] 
it won't be an easy one to cope with, since we don't currently have an ASN.1 
container detector

I think we can potentially get away with a slightly hacky approach similar to 
the PKCS7 signature, where we look for a few variants and hope the right entry 
comes first... "openssl asn1parse" should help with working out what to look for

(Assuming no-one has a bit of time to knock up an ASN1 container detector based 
on the BouncyCastle ASN.1 using an approach similar to 
[https://stackoverflow.com/questions/10190795/parsing-asn-1-binary-data-with-java]
 )

> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3784) Detector returns "application/x-x509-key" when scanning a .p12 file

2022-06-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17550231#comment-17550231
 ] 

Hudson commented on TIKA-3784:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #629 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/629/])
Tests for encrypted RSA keys in PEM and DER, plus a disabled PKCS12 test 
pending TIKA-3784 (nick: 
[https://github.com/apache/tika/commit/6bf9ee120c2845ccdf61207322dcea2373388e75])
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java


> Detector returns "application/x-x509-key" when scanning a .p12 file
> ---
>
> Key: TIKA-3784
> URL: https://issues.apache.org/jira/browse/TIKA-3784
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.26
>Reporter: Matthias Hofbauer
>Priority: Critical
>
> We are using tika to check if the MIME type of the file extensions matches 
> with the MIME type of the file content.
> After our upgrade from tika-core 1.22 to 1.26 our logic does not work anymore 
> for certificates of type .p12, .pfx, .cer, .der.
> For the .p12 and .pfx extension the MIME type is "application/x-pkcs12" but 
> the tika detector returns "application/x-x509-key" instead.
> After checking the tika-mimetype.xml and comparing it to my .p12 file I found 
> the following MIME magic which explains why I got these types back.
> {code:xml}
> 
>     
>     
>     
>     
>     
>                      mask="0x00FC" offset="0"/>
>                      mask="0xFC" offset="0"/>
>     
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)