[ https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539993#comment-17539993 ]
Nick Burch commented on TIKA-3771: ---------------------------------- The PNG magic is priority 50, which is also what our EML min-match 2 is at. That's probably fine for most of them, but \nX- is seemingly too general I think we probably need to lower the priority on the 0:1024 cases, though I'm not sure if we can do that without moving that whole block down? FWIW your PNG matches because it has a URL followed by a bunch of HTTP response headers at the end of it! > Regression from TIKA-3687: Files wrongly detected as EML > --------------------------------------------------------- > > Key: TIKA-3771 > URL: https://issues.apache.org/jira/browse/TIKA-3771 > Project: Tika > Issue Type: Bug > Affects Versions: 2.4.0 > Reporter: Luís Filipe Nassif > Priority: Major > Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png > > > Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, > I detected some hundreds of samples from 1M of different file types now are > being detected as EML. This is caused by the <match value="\nX-" > type="string" offset="0:1024"/> rule added in TIKA-3687 in the > minShouldMatch="2" clause. Attached is a sample PNG file that triggers this > (it also has another \nDate: value in the first 1024 bytes). > Another not related thing, I tried to override the message/rfc822 mime > definition with a custom-tika-mimetypes.xml in classpath, but it had no > effect, it used to work in Tika-1.x. Was that change intentional? I think > user definitions should take precedence over Tika definitions, since they can > change depending on domain or context (e.g. the same extension may be used by > different applications). If it wasn't intentional, I'll open other issue. -- This message was sent by Atlassian Jira (v8.20.7#820007)