[jira] [Commented] (TIKA-4167) CONTENT_TYPE_USER_OVERRIDE doesn't force content type for application/illustrator files

2023-11-06 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-4167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17783431#comment-17783431
 ] 

Sam Stephens commented on TIKA-4167:


Thanks Tim. I don't have a use case where I need this behavior. But I had a 
test for the invariant that when CONTENT_TYPE_USER_OVERRIDE is provided, the 
"Content-Type" in the returned metadata reflects that, and I was surprised to 
see that invariant violated.

I think this is probably a documentation issue. Reading 
https://tika.apache.org/2.9.1/detection.html#The_Detector_Interface, I didn't 
understand that CONTENT_TYPE_USER_OVERRIDE only applies to detection, but the 
parser the override selects then has freedom to chose the content type it 
thinks is most appropriate. You could consider making this a little clearer in 
that documentation.

For my specific use case, regardless of whether my incoming file has an 
explicit content type use for CONTENT_TYPE_USER_OVERRIDE, or whether I'm using 
auto-detection, I always want to detect PDFs as application/pdf, I have no 
interest in or use for the subtypes. But I think it's reasonable for that 
behavior to be something I implement by post-processing the returned Tika 
metadata (basically if Content-Type is application/illustrator, and dc:format 
starts with application/pdf, I use application/pdf as the content type).

> CONTENT_TYPE_USER_OVERRIDE doesn't force content type for 
> application/illustrator files
> ---
>
> Key: TIKA-4167
> URL: https://issues.apache.org/jira/browse/TIKA-4167
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.9.1
>Reporter: Sam Stephens
>Priority: Minor
>
> When I parse a file using AutoDetectParser, with Metadata set to
> {color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: 
> "application/pdf"}{color}
> and parse [a PDF-like Illustrator 
> file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],]
>  the "Content-Type" in the returned metadata is "application/illustrator", 
> not "application/pdf".
> I think this is happening because "application/illustrator" is a subtype of 
> "application/pdf".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-4167) CONTENT_TYPE_USER_OVERRIDE doesn't force content type for application/illustrator files

2023-11-06 Thread Sam Stephens (Jira)
Sam Stephens created TIKA-4167:
--

 Summary: CONTENT_TYPE_USER_OVERRIDE doesn't force content type for 
application/illustrator files
 Key: TIKA-4167
 URL: https://issues.apache.org/jira/browse/TIKA-4167
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.9.1
Reporter: Sam Stephens


When I parse a file using AutoDetectParser, with Metadata set to
{color:#ce9178}{TikaCoreProperties.CONTENT_TYPE_USER_OVERRIDE: 
"application/pdf"}{color}
and parse [a PDF-like Illustrator 
file|[https://github.com/apache/tika/blob/78be82565df4cc3bbc88308be8d686019a10b899/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/src/test/resources/test-documents/testPDF_AdobeIllustrator.pdf],]
 the "Content-Type" in the returned metadata is "application/illustrator", not 
"application/pdf".

I think this is happening because "application/illustrator" is a subtype of 
"application/pdf".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-06-08 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871
 ] 

Sam Stephens commented on TIKA-3768:


Ah, interesting, this is a case of me misunderstanding the product then.

This means that in order to actually get all the text possible out of a file, I 
need to examine both the actual text and the metadata (I'm using this for 
building a search over documents of many types).

The challenge then is that some fields in the metadata object are sourced from 
text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and 
should be searchable, and some that are not (such as {{Content-Type}} and 
{{{}X-TIKA:Parsed-By{}}}), and should not be searchable.

Is there any documentation of the set of possible metadata fields? The 
constants inherited by 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] 
don't appear to be a complete set, as I don't see {{dc:subject}} amongst them.

It looks to me like I could strip out fields like {{Content-Type}} as listed in 
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and 
any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields 
would be sourced from document text.

> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3769) md5 incorrectly detected as application/marc

2022-05-18 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539138#comment-17539138
 ] 

Sam Stephens commented on TIKA-3769:


Thanks for the prompt fix!

> md5 incorrectly detected as application/marc
> 
>
> Key: TIKA-3769
> URL: https://issues.apache.org/jira/browse/TIKA-3769
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Fix For: 2.4.1
>
> Attachments: md5.txt
>
>
> When I parse the attached text document using AutoDetectParser, its 
> incorrectly detected as application/marc with no text. As other md5s I 
> generated randomly correctly detected as text, I'm guessing that the Marc 
> parser is using some kind of magic bytes to detect Marc files that this file 
> matches as a false positive.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539054#comment-17539054
 ] 

Sam Stephens commented on TIKA-3710:


{quote}The h1 isn't quite as unique as we might like, and maybe not as good as 
some of the other ones
{quote}
Honestly, I'm not so worried about the HTML fragment detection, because that's 
never going to be perfect. A bare text string without any HTML tags is 
technically an HTML fragment. In the modern world where people can and do 
define their own HTML tags, you _could_ say that any file opens with a valid 
tag as defined by the W3C is HTML, but that feels open to false positives.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539051#comment-17539051
 ] 

Sam Stephens commented on TIKA-3710:


Is it valid for a message/rfc822 message to have a bunch of preamble like the 
HTML tags in my document before the headers? Is the answer that the RFC822 
detection here is too loose, and the non-header material at the beginning of my 
file should result in the message/rfc822 parser rejecting it?

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3769) md5 incorrectly detected as application/marc

2022-05-17 Thread Sam Stephens (Jira)
Sam Stephens created TIKA-3769:
--

 Summary: md5 incorrectly detected as application/marc
 Key: TIKA-3769
 URL: https://issues.apache.org/jira/browse/TIKA-3769
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.4.0
Reporter: Sam Stephens
 Attachments: md5.txt

When I parse the attached text document using AutoDetectParser, its incorrectly 
detected as application/marc with no text. As other md5s I generated randomly 
correctly detected as text, I'm guessing that the Marc parser is using some 
kind of magic bytes to detect Marc files that this file matches as a false 
positive.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-05-17 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538528#comment-17538528
 ] 

Sam Stephens commented on TIKA-3711:


Thanks [~tallison] , confirmed this is working for me as expected.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-17 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538524#comment-17538524
 ] 

Sam Stephens commented on TIKA-3710:


Note that I exclude org.apache.tika.parser.mail.RFC822Parser as a parser, my 
debugging appears to show org.apache.tika.parser.csv.TextAndCSVParser being 
used for parsing, we get the full raw text of the document, including HTML 
tags, and the returned content type is 'message/rfc822; charset=ISO-8859-1'.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-05-17 Thread Sam Stephens (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stephens updated TIKA-3768:
---
Description: 
When running AutoDetectParser on message/rfc822 structured text documents, such 
as the attached [^email.txt], the extracted text does not include any of the 
headers, such as the Subject and From and To lines.

However these lines contain useful text I'd like to be able to extract. I'm 
surprised it's not there based on the include everything bias I saw on 
https://issues.apache.org/jira/browse/TIKA-3710.

Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
parser, my debugging appears to show 
org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we get 
the full text, but the returned content type is 'message/rfc822; 
charset=windows-1252'.

  was:
When running AutoDetectParser on message/rfc822 structured text documents, such 
as the attached [^email.txt], the extracted text does not include any of the 
headers, such as the Subject and From and To lines.

However these lines contain useful text I'd like to be able to extract. I'm 
surprised it's not there based on the include everything bias I saw on 
https://issues.apache.org/jira/browse/TIKA-3710.


> message/rfc822 does not include Headers in extracted text
> -
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents, 
> such as the attached [^email.txt], the extracted text does not include any of 
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm 
> surprised it's not there based on the include everything bias I saw on 
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a 
> parser, my debugging appears to show 
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we 
> get the full text, but the returned content type is 'message/rfc822; 
> charset=windows-1252'.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (TIKA-3768) message/rfc822 does not include Headers in extracted text

2022-05-17 Thread Sam Stephens (Jira)
Sam Stephens created TIKA-3768:
--

 Summary: message/rfc822 does not include Headers in extracted text
 Key: TIKA-3768
 URL: https://issues.apache.org/jira/browse/TIKA-3768
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.4.0
Reporter: Sam Stephens
 Attachments: email.txt

When running AutoDetectParser on message/rfc822 structured text documents, such 
as the attached [^email.txt], the extracted text does not include any of the 
headers, such as the Subject and From and To lines.

However these lines contain useful text I'd like to be able to extract. I'm 
surprised it's not there based on the include everything bias I saw on 
https://issues.apache.org/jira/browse/TIKA-3710.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-12 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521349#comment-17521349
 ] 

Sam Stephens commented on TIKA-3666:


It looks like [~4U6U57] and I have both provided POIFSViewer outputs for you. 
[~tallison] I hope this is useful for you. Thanks!

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
> Attachments: poifsviewer.txt, sam-poifsviewer.txt
>
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-12 Thread Sam Stephens (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stephens updated TIKA-3666:
---
Attachment: sam-poifsviewer.txt

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
> Attachments: poifsviewer.txt, sam-poifsviewer.txt
>
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3666) Detect and indicate file encrypted with Rights Management Service RMS/IRM

2022-04-11 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520887#comment-17520887
 ] 

Sam Stephens commented on TIKA-3666:


[~4U6U57] did you have any luck sourcing sample files? I also want a fix for 
this, but unfortunately I have no way to provide a sample document, as I don't 
have access to the Microsoft products needed to create IRM files (the files I'm 
having issues with are sourced externally). Thanks for reporting this issue!

> Detect and indicate file encrypted with Rights Management Service RMS/IRM
> -
>
> Key: TIKA-3666
> URL: https://issues.apache.org/jira/browse/TIKA-3666
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata
>Reporter: August Valera
>Priority: Major
>
> Rights Management Service (RMS), implemented in MS Office as Information 
> Rights Management (IRM), allows organizations to set file permissions that 
> are stored within the file. In most cases, this will result in the file 
> getting a new extension (with a prefix p, such as {{.txt}} becoming 
> {{{}.ptxt{}}}), but in the case of MS Office and PDF files, which support 
> this natively, the implementation results in the file contents being 
> encrypted without any extension change. 
> h4. Current behavior
> Running such files through Tika produces results as if it was an empty file 
> ran through {{DefaultParser}} and {{{}OfficeParser{}}}.
> h4. Expected behavior
> Extract more metadata about necessary permissions to view (if possible), and 
> throwing {{EncryptedDocumentException}} as is the case with Office files 
> encrypted in the more traditional manner.
> Reference: 
> [https://docs.microsoft.com/en-us/azure/information-protection/rms-client/clientv2-admin-guide-file-types#supported-file-types-for-classification-and-protection]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-05 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167
 ] 

Sam Stephens edited comment on TIKA-3711 at 4/5/22 10:00 PM:
-

Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is image.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.


was (Author: JIRAUSER287416):
Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is image.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-04 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167
 ] 

Sam Stephens edited comment on TIKA-3711 at 4/5/22 12:25 AM:
-

Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is image.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.


was (Author: JIRAUSER287416):
Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is still image1.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-04 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517167#comment-17517167
 ] 

Sam Stephens commented on TIKA-3711:


Regarding filenames, I don't think they will ever be semantically meaningful. I 
just created a document with Word 365 (uploaded as 
word-doc-with-image-from-word-365.docx), added a picture with the filename 
test-image.png, and the extracted filename is still image1.png. I think Word is 
creating non-interesting filenames.

As far as breaking other users, I'm raising this bug because this *is* a change 
in behavior that's broken me. I'm relying on the Tika 2.2.0 behavior where 
images are not part of the text.

Using the ToXMLContentHandler to look at the actual generated HTML, I think 
that's also going to be surprising behavior for most users.

{{}}
{{image.png}}
{{}}

My image actually has alt text; that's not included. And I think including the 
image file name as a header in the markup is going to be surprising to almost 
every user. It certainly doesn't match the source document (which has no 
headers, or visible text of any kind).

As an end user, what I'd like is the XHTML to be

{{}}
{{}}

And the text from BodyContentHandler to not include the image at all. That way 
the text is the text, and if I have an interest in Image alt tags, I can 
operate on the XHTML.

If you wanted to include an option to provide text for the image, I don't think 
image filenames will ever be useful from Word; alt text is the right place 
semantically to be looking for a textual representation of an image.

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-04 Thread Sam Stephens (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stephens updated TIKA-3711:
---
Attachment: word-doc-with-image-from-word-365.docx

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image-from-word-365.docx, 
> word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-04-01 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516142#comment-17516142
 ] 

Sam Stephens commented on TIKA-3710:


The HTML document is exactly what you see there; these documents are fragments, 
not full HTML documents. However I did try wrapping the fragment in  and 
 tags to make a full document, and it was still detected as 
message/rfc822.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3711) Image file names included in parsed Word Document text

2022-04-01 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516141#comment-17516141
 ] 

Sam Stephens commented on TIKA-3711:


I guess the question is what are the semantics of this operation? When I ask 
for the text of a document, what does that actually mean?

As an end user, I'd argue the semantics that are most useful to end users is 
that getting the text of a document provides the closest possible 
representation of the text a user would read when reading the document in its 
native format.

By this argument, the image filenames should not be there, because I wouldn't 
see image filenames if I was reading the Word document from within Word.

 

I don't think more information for its own sake is necessarily good. If I argue 
this from a reductio ad absurdum perspective, I'd then say that adding text 
describing all document formatting is useful. Adding the words "Heading 1" each 
time there's a heading, "Bold" and "Unbold" each time a bolded section occurs. 
This is clearly more information, but it's also clear that adding this 
information would rapidly make the text you extracted from a Word document 
unusable.

 

>From an end user perspective, I'm using this text extraction so I can put 
>documents in a search index. Having the terms "image1", "image2" etc show up 
>in my index for documents that contain images is not useful behavior, unless 
>that actually occurs in the real text of the document.

The image filenames are metadata. If I wanted that metadata, I can engage with 
the full XHTML representation of the document to get it. But my take is that 
BodyContentHandler should give me text, not metadata.

 

> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Minor
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (TIKA-3711) Image file names included in parsed Word Document text

2022-03-31 Thread Sam Stephens (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Stephens updated TIKA-3711:
---
Description: 
The attached Word document includes nothing but a single image. Running it 
through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
through the Tika 2.3.0 AutoDetectParser returns the text:

{{image1.png}}

 

  was:
The attached Word document includes nothing but a single image. Running it 
through the Tika 2.2.0 AutoDetectParser correctly returns no text. Running it 
through the Tika 2.3.0 AutoDetectParser returns the text:


{{image1.png}}

 


> Image file names included in parsed Word Document text
> --
>
> Key: TIKA-3711
> URL: https://issues.apache.org/jira/browse/TIKA-3711
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: word-doc-with-image.docx
>
>
> The attached Word document includes nothing but a single image. Running it 
> through the Tika 2.2.0 AutoDetectParser correctly returns null. Running it 
> through the Tika 2.3.0 AutoDetectParser returns the text:
> {{image1.png}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3711) Image file names included in parsed Word Document text

2022-03-31 Thread Sam Stephens (Jira)
Sam Stephens created TIKA-3711:
--

 Summary: Image file names included in parsed Word Document text
 Key: TIKA-3711
 URL: https://issues.apache.org/jira/browse/TIKA-3711
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.3.0
Reporter: Sam Stephens
 Attachments: word-doc-with-image.docx

The attached Word document includes nothing but a single image. Running it 
through the Tika 2.2.0 AutoDetectParser correctly returns no text. Running it 
through the Tika 2.3.0 AutoDetectParser returns the text:


{{image1.png}}

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-03-31 Thread Sam Stephens (Jira)
Sam Stephens created TIKA-3710:
--

 Summary: HTML document detected incorrect as message/rfc822
 Key: TIKA-3710
 URL: https://issues.apache.org/jira/browse/TIKA-3710
 Project: Tika
  Issue Type: Bug
  Components: detector
Affects Versions: 2.3.0
Reporter: Sam Stephens
 Attachments: html-that-looks-like-rfc822.html

I'm detecting content types and extracting text from documents using the 
AutoDetectParser.

I've received some documents that are HTML fragments generated from emails. The 
documents are clearly HTML, not emails, but the AutoDetectParser gives me the 
MIME type message/rfc822 and no text. I've attached an example.

It looks like the presence of From:, Sent:, and Subject: at the beginning of 
lines is why the documents are matching RFC822. However, I believe the presence 
of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)