[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706400#comment-17706400 ]
Andrew Jackson commented on TIKA-3992: -------------------------------------- Sounds interesting! Just wanted to note that Siegfried (and DROID/etc) signatures often require end-of-file matches as well as beginning-of-file, so if you do truncate the files you'll get the best results by chopping out the middle. I'd imagine the first and last few KB should do it. > Add common missing mimes based on Common Crawl data > --------------------------------------------------- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)