Luís Filipe Nassif created TIKA-3596:
----------------------------------------

             Summary: Detect corrupted XML files as application/xml instead of 
text/plain
                 Key: TIKA-3596
                 URL: https://issues.apache.org/jira/browse/TIKA-3596
             Project: Tika
          Issue Type: Improvement
          Components: detector
    Affects Versions: 2.1.0, 1.27
            Reporter: Luís Filipe Nassif


There is a logic in MimeTypes class to return text/plain for corrupted xml 
files not detected as text/html here: 
https://github.com/apache/tika/blob/324f2f2ccff21c608969e2e79da88e71379a58dc/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L281

I think this should be changed to return application/xml, even if the file is 
corrupted, like is done for all other mimetypes, being more consistent across 
file formats. Even if a jpg or doc file is corrupted, image/jpg or 
application/msword is returned.

I have about ~2k from ~90k xml files in an internal corpus that trigger this.

If other fellow devs agree, I can submit a patch and unit test.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to