Viorica Visan created TIKA-2443:
-----------------------------------

             Summary: Plain text file identified as rfc822 and which can cause 
StackOverflowError
                 Key: TIKA-2443
                 URL: https://issues.apache.org/jira/browse/TIKA-2443
             Project: Tika
          Issue Type: Bug
            Reporter: Viorica Visan


I have a file called test.txt, containing only:
Date:           06/25/2014 15:54:19
And some more text I am writing. This will
be detected as rfc822

This file is detected and parsed as message/rfc822. 
I think the magic rule on "Date: " is too strong and it should have detected 
only as plain/text file. It looks to me like the reverse of  
https://issues.apache.org/jira/browse/TIKA-879 

We noticed this issue, because we have a large log file, which has many lines 
with Date, Log level and Message which is parsed as message/rfc822 (only 
because it starts with "Date:") and which throws 
StackOverflowError in the end. 


Is there some workaround to make this rule weaker ? through configuration ? 
We use DefaultParser and everything default. We use tika in 1.11 version, but 
we tried also  with tika 1.16 and we saw the same StackOverflowError (which 
probably again happened because it was parsed as a rc822 type).
The only workaround that I found was to add 

custom-mimetypes.xml like this
 <mime-type type="text/plain">
    <magic priority="70">
      <match value="Date:" type="string" offset="0"/>
    </magic>
  </mime-type>
Would you recomend some other workaround to make sure the file does not get 
parsed as rfc822 ? 
And I have another question: can this custom-mimetypes.xml be specified from an 
external location? 

Many thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to