[
https://issues.apache.org/jira/browse/TIKA-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447624#comment-17447624
]
Tim Allison commented on TIKA-3596:
-----------------------------------
I'm +1. Thank you!
> Detect corrupted XML files as application/xml instead of text/plain
> -------------------------------------------------------------------
>
> Key: TIKA-3596
> URL: https://issues.apache.org/jira/browse/TIKA-3596
> Project: Tika
> Issue Type: Improvement
> Components: detector
> Affects Versions: 1.27, 2.1.0
> Reporter: Luís Filipe Nassif
> Assignee: Luís Filipe Nassif
> Priority: Minor
> Attachments: test.xyz
>
>
> There is a logic in MimeTypes class to return text/plain for corrupted xml
> files not detected as text/html here:
> https://github.com/apache/tika/blob/324f2f2ccff21c608969e2e79da88e71379a58dc/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L281
> I think this should be changed to return application/xml, even if the file is
> corrupted, like is done for all other mimetypes, being more consistent across
> file formats. Even if a jpg or doc file is corrupted, image/jpg or
> application/msword is returned.
> I have about ~2k from ~90k xml files in an internal corpus that trigger this.
> If other fellow devs agree, I can submit a patch and unit test.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)