[jira] [Updated] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-20 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated TIKA-3771:
-
Description: 
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples from 1M of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect. 
It used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.

  was:
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples from 1M of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.


> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the  type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect. It used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated TIKA-3771:
-
Description: 
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples from 1M of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.

  was:
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.


> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples from 1M of different file types now are 
> being detected as EML. This is caused by the  type="string" offset="0:1024"/> rule added in TIKA-3687 in the 
> minShouldMatch="2" clause. Attached is a sample PNG file that triggers this 
> (it also has another \nDate: value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (TIKA-3771) Regression from TIKA-3687: Files wrongly detected as EML

2022-05-19 Thread Jira


 [ 
https://issues.apache.org/jira/browse/TIKA-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luís Filipe Nassif updated TIKA-3771:
-
Description: 
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). If it wasn't intentional, I'll open other issue.

  was:
Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, I 
detected some hundreds of samples of different file types now are being 
detected as EML. This is caused by the  rule added in TIKA-3687 in the minShouldMatch="2" clause. 
Attached is a sample PNG file that triggers this (it also has another \nDate: 
value in the first 1024 bytes).

Another not related thing, I tried to override the message/rfc822 mime 
definition with a custom-tika-mimetypes.xml in classpath, but it had no effect, 
it used to work in Tika-1.x. Was that change intentional? I think user 
definitions should take precedence over Tika definitions, since they can change 
depending on domain or context (e.g. the same extension may be used by 
different applications). 


> Regression from TIKA-3687: Files wrongly detected as EML 
> -
>
> Key: TIKA-3771
> URL: https://issues.apache.org/jira/browse/TIKA-3771
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Luís Filipe Nassif
>Priority: Major
> Attachments: BEA498353ECFA1C440365BB434BBC228269917D7.png
>
>
> Running regression tests in the process of upgrading to Tika-2.4.0 from 1.x, 
> I detected some hundreds of samples of different file types now are being 
> detected as EML. This is caused by the  offset="0:1024"/> rule added in TIKA-3687 in the minShouldMatch="2" clause. 
> Attached is a sample PNG file that triggers this (it also has another \nDate: 
> value in the first 1024 bytes).
> Another not related thing, I tried to override the message/rfc822 mime 
> definition with a custom-tika-mimetypes.xml in classpath, but it had no 
> effect, it used to work in Tika-1.x. Was that change intentional? I think 
> user definitions should take precedence over Tika definitions, since they can 
> change depending on domain or context (e.g. the same extension may be used by 
> different applications). If it wasn't intentional, I'll open other issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)