Tim Allison created TIKA-3992:
---------------------------------

             Summary: Add common missing mimes based on Common Crawl data
                 Key: TIKA-3992
                 URL: https://issues.apache.org/jira/browse/TIKA-3992
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
detected by Tika.  It would be useful to extract those (even if truncated) and 
run 'file' and 'siegfried' against those file types that are unknown to Tika.  
We can prioritize the most common file formats as identified by file and 
siegfried for addition to our mime-types.xml.

Separately, we might also want to do the same thing for 
`application/zip`...there are likely zip-based file types that we could do a 
better job on.

Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to