[ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17540595#comment-17540595
 ] 

Luís Filipe Nassif commented on TIKA-3771:
------------------------------------------

Great, I'll commit with a unit test later today or early tomorrow, thank you!

> Regression from TIKA-3687: Files wrongly detected as EML 
> ---------------------------------------------------------
>
>                 Key: TIKA-3771
>                 URL: https://issues.apache.org/jira/browse/TIKA-3771
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Luís Filipe Nassif
>            Priority: Major
>         Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the <match value="\nX-" 
> type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect. It used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to