[ https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luís Filipe Nassif updated TIKA-3771: ------------------------------------- Description: Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I detected some hundreds of samples of different file types now are being detected as EML. This is caused by the <match value="\nX-" type="string" offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. Attached is a sample PNG file that triggers this (it also has another \nDate: value in the first 1024 bytes). Another not related thing, I tried to override the message/rfc822 mime definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, it used to work in Tika-1.x. Was that change intentional? I think user definitions should take precedence over Tika definitions, since they can change depending on domain or context (e.g. the same extension may be used by different applications). If it wasn't intentional, I'll open other issue. was: Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I detected some hundreds of samples of different file types now are being detected as EML. This is caused by the <match value="\nX-" type="string" offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. Attached is a sample PNG file that triggers this (it also has another \nDate: value in the first 1024 bytes). Another not related thing, I tried to override the message/rfc822 mime definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, it used to work in Tika-1.x. Was that change intentional? I think user definitions should take precedence over Tika definitions, since they can change depending on domain or context (e.g. the same extension may be used by different applications). > Regression from TIKA-3687: Files wrongly detected as EML > --------------------------------------------------------- > > Key: TIKA-3771 > URL: https://issues.apache.org/jira/browse/TIKA-3771 > Project: Tika > Issue Type: Bug > Affects Versions: 2.4.0 > Reporter: Luís Filipe Nassif > Priority: Major > Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png > > > Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, > I detected some hundreds of samples of different file types now are being > detected as EML. This is caused by the <match value="\nX-" type="string" > offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. > Attached is a sample PNG file that triggers this (it also has another \nDate: > value in the first 1024 bytes). > Another not related thing, I tried to override the message/rfc822 mime > definition with a custom-tika-mimetypes.xml in classpath, but it had no > effect, it used to work in Tika-1.x. Was that change intentional? I think > user definitions should take precedence over Tika definitions, since they can > change depending on domain or context (e.g. the same extension may be used by > different applications). If it wasn't intentional, I'll open other issue. -- This message was sent by Atlassian Jira (v8.20.7#820007)