[
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3992:
------------------------------
Attachment: extracted-urls-all.csv.zip
> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
> Key: TIKA-3992
> URL: https://issues.apache.org/jira/browse/TIKA-3992
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: extracted-urls-all.csv.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as
> detected by Tika. It would be useful to extract those (even if truncated)
> and run 'file' and 'siegfried' against those file types that are unknown to
> Tika. We can prioritize the most common file formats as identified by file
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for
> `application/zip`...there are likely zip-based file types that we could do a
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)