[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739089#comment-17739089 ]
Gregory Lepore commented on TIKA-3992: -------------------------------------- Got a chance to download the May/June CommonCrawl dataset. There were 132,431 application/octet-stream files in the download. Running a recent Tika against those octet-stream files results in 22,025 positively identified files, mostly falling into categories added in this ticket (side note, it looks like a lot of the formats listed as Open on this ticket have actually been added - FAT, Jigdo, MMM, etc., is this correct?) Running all of my signatures and the current PRONOM signatures against the data would add another 25,962 positively identified files to the set, for a total of 47,987, or roughly 36% of the total. The top ten in the additional list are: 4942 Modified Maximum Method Digisonde Portable Sounder File 1828 Valve Source BSP Format 1239 MPEG 1/2 Audio Layer 3 1100 SquashSF Image File 1006 bigWig Track Format 841 NumPy array 698 BigBed Format 626 MoPaQ (MPQ) archive 532 NASA SPICE file (binary format) 503 Unreal Engine Package It's a bit weird that the Digisonde file appears on both the Tika identified list, and on the Tika not identified list (in other words, some MMM files appear as octet-stream in the CC dataset and others are correctly identified). Possibly there's a difference in the PRONOM signature and the Tika signature. Oddly enough there were two files that Tika timed out on, odd because the files were only 1MB due to the truncation in the CommonCrawl data. Both were identified as "new-fs dump file (little endian)" by the `file` command. This might be an area to investigate. These files are 3d055010c16209349b6394171a81f55f99f1dc323f60037be67b544b63e3505c and 501111266ba96e8ead4c513d28b6fc81654f6b3d52b4763b0d8a39a9d0aafe53. I will continue to work on the CC dataset as there are dozens of additional file format signatures that can be added from this data. I will also add the additionally recognized files to this ticket to further reduce the octet-stream numbers. I will be out of pocket until July 20th or so, work will continue then! > Add common missing mimes based on Common Crawl data > --------------------------------------------------- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: mimes.zip > > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)