[ https://issues.apache.org/jira/browse/TIKA-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539574#comment-17539574 ]
Tim Allison edited comment on TIKA-3710 at 5/19/22 2:25 PM: ------------------------------------------------------------ Sorry, that comment must have referred to the patterns in that block that allowed content before the html tags. The patterns currently require the {{<h1}} etc as the first character. We could move the patterns that require a match as the first character to a different block with a higher priority? bq. Is it valid for a message/rfc822 message to have a bunch of preamble like the HTML tags in my document before the headers? My memory is that we've seen some crazy headers before the usual rfc822 headers. I do not think we've seen html tags in those. was (Author: talli...@mitre.org): Sorry, that comment must have referred to the patterns in that block that allowed content before the html tags. The patterns currently require the {{<h1}} etc as the first character. We could move the patterns that require a match as the first character to a different block with a higher priority? >Is it valid for a message/rfc822 message to have a bunch of preamble like the >HTML tags in my document before the headers? My memory is that we've seen some crazy headers before the usual rfc822 headers. I do not think we've seen html tags in those. > HTML document detected incorrect as message/rfc822 > -------------------------------------------------- > > Key: TIKA-3710 > URL: https://issues.apache.org/jira/browse/TIKA-3710 > Project: Tika > Issue Type: Bug > Components: detector > Affects Versions: 2.3.0 > Reporter: Sam Stephens > Priority: Major > Attachments: html-that-looks-like-rfc822.html > > > I'm detecting content types and extracting text from documents using the > AutoDetectParser. > I've received some documents that are HTML fragments generated from emails. > The documents are clearly HTML, not emails, but the AutoDetectParser gives me > the MIME type message/rfc822 and no text. I've attached an example. > It looks like the presence of From:, Sent:, and Subject: at the beginning of > lines is why the documents are matching RFC822. However, I believe the > presence of HTML before these headers means the document is not valid RFC822. -- This message was sent by Atlassian Jira (v8.20.7#820007)