[ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706405#comment-17706405
 ] 

Tim Allison commented on TIKA-3992:
-----------------------------------

Ah, that's helpful. Thank you!  By "truncated", I was referring to the feature 
of CC where they truncate fetches at 1MB.  So, we really don't have access to 
the ends of the files unless we refetch from the original URLs, which I am not 
proposing doing on this ticket.

We'll see what we can do with what we have...

> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
>                 Key: TIKA-3992
>                 URL: https://issues.apache.org/jira/browse/TIKA-3992
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to