Tim Allison created TIKA-3992: --------------------------------- Summary: Add common missing mimes based on Common Crawl data Key: TIKA-3992 URL: https://issues.apache.org/jira/browse/TIKA-3992 Project: Tika Issue Type: Task Reporter: Tim Allison
In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as detected by Tika. It would be useful to extract those (even if truncated) and run 'file' and 'siegfried' against those file types that are unknown to Tika. We can prioritize the most common file formats as identified by file and siegfried for addition to our mime-types.xml. Separately, we might also want to do the same thing for `application/zip`...there are likely zip-based file types that we could do a better job on. Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)