[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-04-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515921#comment-17515921
 ] 

Tim Allison commented on TIKA-3710:
---

Did the original html file actually have an html header?  Or did it literally 
start at ?

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-04-01 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516142#comment-17516142
 ] 

Sam Stephens commented on TIKA-3710:


The HTML document is exactly what you see there; these documents are fragments, 
not full HTML documents. However I did try wrapping the fragment in  and 
 tags to make a full document, and it was still detected as 
message/rfc822.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-17 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538524#comment-17538524
 ] 

Sam Stephens commented on TIKA-3710:


Note that I exclude org.apache.tika.parser.mail.RFC822Parser as a parser, my 
debugging appears to show org.apache.tika.parser.csv.TextAndCSVParser being 
used for parsing, we get the full raw text of the document, including HTML 
tags, and the returned content type is 'message/rfc822; charset=ISO-8859-1'.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538885#comment-17538885
 ] 

Tim Allison commented on TIKA-3710:
---

As I look at our mime type for html, we do include {{h1}} at offset 0 as 
sufficient to identify html.  Here you have {{h2}}... hmmm... [~nick]?

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538896#comment-17538896
 ] 

Nick Burch commented on TIKA-3710:
--

The h1 isn't quite as unique as we might like, and maybe not as good as some of 
the other ones

How about changing that to  or  HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538963#comment-17538963
 ] 

Tim Allison commented on TIKA-3710:
---

Thank you, [~nick]. I was being imprecise on {{h1}}, we actually do require 
{{ HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538974#comment-17538974
 ] 

Tim Allison commented on TIKA-3710:
---

The hiccup is this point in the mimetypes.xml file.

{noformat}



{noformat}

Even if the file starts with {{}}, we're identifying the file as 
{{appplication/rfc822}}.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539051#comment-17539051
 ] 

Sam Stephens commented on TIKA-3710:


Is it valid for a message/rfc822 message to have a bunch of preamble like the 
HTML tags in my document before the headers? Is the answer that the RFC822 
detection here is too loose, and the non-header material at the beginning of my 
file should result in the message/rfc822 parser rejecting it?

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-18 Thread Sam Stephens (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539054#comment-17539054
 ] 

Sam Stephens commented on TIKA-3710:


{quote}The h1 isn't quite as unique as we might like, and maybe not as good as 
some of the other ones
{quote}
Honestly, I'm not so worried about the HTML fragment detection, because that's 
never going to be perfect. A bare text string without any HTML tags is 
technically an HTML fragment. In the modern world where people can and do 
define their own HTML tags, you _could_ say that any file opens with a valid 
tag as defined by the W3C is HTML, but that feels open to false positives.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539574#comment-17539574
 ] 

Tim Allison commented on TIKA-3710:
---

Sorry, that comment must have referred to the patterns in that block that 
allowed content before the html tags.  The patterns currently require the 
{{Is it valid for a message/rfc822 message to have a bunch of preamble like the 
>HTML tags in my document before the headers? 
My memory is that we've seen some crazy headers before the usual rfc822 
headers.  I do not think we've seen html tags in those.

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539580#comment-17539580
 ] 

Tim Allison commented on TIKA-3710:
---

This works on the test file:

{noformat}

  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  




  
  
  
  
  
  
  
  
  
  

{noformat}

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539582#comment-17539582
 ] 

Nick Burch commented on TIKA-3710:
--

I was thinking we'd do (open)h1(close) or (open)h1(space) to cover both HTML 
cases but reduce the changes of a false positive match (+h2/h3)

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539590#comment-17539590
 ] 

Tim Allison commented on TIKA-3710:
---

Sounds good.  What do you think of breaking those out into a higher priority 
block as above?  Obv, we'll need to run this on a bunch of docs to see if this 
is overall a good change...

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539594#comment-17539594
 ] 

Nick Burch commented on TIKA-3710:
--

As a "normal" html file wouldn't start with these snippets, and they're already 
at a pretty high priority, I think just leave them in the 60 block along with 
the more typical starting tags we have there now

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-05-19 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539607#comment-17539607
 ] 

Tim Allison commented on TIKA-3710:
---

The current main block is 40, which is intentionally below RFC822.

How's this look:

{noformat}

  
  




  
  
...
{noformat}

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544964#comment-17544964
 ] 

Tim Allison commented on TIKA-3710:
---

I just committed and pushed this.  Please let me know if there are any 
objections.  [~lfcnassif]?

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545044#comment-17545044
 ] 

Hudson commented on TIKA-3710:
--

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #624 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/624/])
TIKA-3710 - update mime detection for rfc822, again. (tallison: 
[https://github.com/apache/tika/commit/a52e4d153e950077f7fdedadcc5d75604fe2563d])
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/resources/test-documents/testBrokenHTMLContainingRFC822.html
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/mime/TestMimeTypes.java
* (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml


> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-01 Thread Jira


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545156#comment-17545156
 ] 

Luís Filipe Nassif commented on TIKA-3710:
--

Seems good to me [~tallison] !

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (TIKA-3710) HTML document detected incorrect as message/rfc822

2022-06-02 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545383#comment-17545383
 ] 

Tim Allison commented on TIKA-3710:
---

Thank you, [~lfcnassif]!

> HTML document detected incorrect as message/rfc822
> --
>
> Key: TIKA-3710
> URL: https://issues.apache.org/jira/browse/TIKA-3710
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 2.3.0
>Reporter: Sam Stephens
>Priority: Major
> Attachments: html-that-looks-like-rfc822.html
>
>
> I'm detecting content types and extracting text from documents using the 
> AutoDetectParser.
> I've received some documents that are HTML fragments generated from emails. 
> The documents are clearly HTML, not emails, but the AutoDetectParser gives me 
> the MIME type message/rfc822 and no text. I've attached an example.
> It looks like the presence of From:, Sent:, and Subject: at the beginning of 
> lines is why the documents are matching RFC822. However, I believe the 
> presence of HTML before these headers means the document is not valid RFC822.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)