[ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739089#comment-17739089
 ] 

Gregory Lepore commented on TIKA-3992:
--------------------------------------

Got a chance to download the May/June CommonCrawl dataset. There were 132,431 
application/octet-stream files in the download.

 

Running a recent Tika against those octet-stream files results in 22,025 
positively identified files, mostly falling into categories added in this 
ticket (side note, it looks like a lot of the formats listed as Open on this 
ticket have actually been added - FAT, Jigdo, MMM, etc., is this correct?)

 

Running all of my signatures and the current PRONOM signatures against the data 
would add another 25,962 positively identified files to the set, for a total of 
47,987, or roughly 36% of the total.

 

The top ten in the additional list are:

  4942 Modified Maximum Method Digisonde Portable Sounder File 
  1828 Valve Source BSP Format 
  1239 MPEG 1/2 Audio Layer 3 
  1100 SquashSF Image File 
  1006 bigWig Track Format 
   841 NumPy array 
   698 BigBed Format 
   626 MoPaQ (MPQ) archive 
   532 NASA SPICE file (binary format) 
   503 Unreal Engine Package



It's a bit weird that the Digisonde file appears on both the Tika identified 
list, and on the Tika not identified list (in other words, some MMM files 
appear as octet-stream in the CC dataset and others are correctly identified). 
Possibly there's a difference in the PRONOM signature and the Tika signature.

Oddly enough there were two files that Tika timed out on, odd because the files 
were only 1MB due to the truncation in the CommonCrawl data. Both were 
identified as "new-fs dump file (little endian)" by the `file` command. This 
might be an area to investigate. These files are 
3d055010c16209349b6394171a81f55f99f1dc323f60037be67b544b63e3505c and 
501111266ba96e8ead4c513d28b6fc81654f6b3d52b4763b0d8a39a9d0aafe53.

 

I will continue to work on the CC dataset as there are dozens of additional 
file format signatures that can be added from this data. I will also add the 
additionally recognized files to this ticket to further reduce the octet-stream 
numbers.

 

I will be out of pocket until July 20th or so, work will continue then!

 

> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
>                 Key: TIKA-3992
>                 URL: https://issues.apache.org/jira/browse/TIKA-3992
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: mimes.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to