[ https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725873#comment-17725873 ]
Gregory Lepore commented on TIKA-3992: -------------------------------------- Looking at the full-table.csv file there are some easy wins to reduce the number of unidentified formats (in addition to the audio xm/x-mod work). The following formats are cleanly identified by file and/or siegfried, but are currently application/octet-stream: Extension Mime Format Count laz application/octet-stream ASPRS Lidar Data Exchange Format 4006 bsp application/octet-stream The Source Engine BSP File Format/NASA SPICE file (binary format) 2800 zim application/octet-stream Zeno IMproved 2711 jdf application/octet-stream JEOL NMR Spectroscopy 2597 blend application/octet-stream (application/x-blender) Blender 3D 947 That's just looking over the most common extensions in the collection and verifying several files. What's the best way to work through these formats and get them added to Tika? I'm happy to extract the format identification data from PRONOM and/or my research to add to Tika. For example for the ASPRS Lidar Data Exchange Format, `roy inspect` (a companion tool to siegfried) reports: roy inspect fmt/370 ASPRS LIDAR DATA EXCHANGE FORMAT (FMT/370) globs: *.las, *.laz sigs: (B:0 seq "LASF" | P:20 seq "\x01\x02" | P:78 r "\x00" - 99) So LASF at offset 0, plus 0102 at offset 24 and I'm not sure about the last bit. The Format Documentation is at: [https://www.asprs.org/wp-content/uploads/2010/12/asprs_las_format_v10.pdf] but says nothing about mime type. Thoughts? > Add common missing mimes based on Common Crawl data > --------------------------------------------------- > > Key: TIKA-3992 > URL: https://issues.apache.org/jira/browse/TIKA-3992 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: mimes.zip > > > In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as > detected by Tika. It would be useful to extract those (even if truncated) > and run 'file' and 'siegfried' against those file types that are unknown to > Tika. We can prioritize the most common file formats as identified by file > and siegfried for addition to our mime-types.xml. > Separately, we might also want to do the same thing for > `application/zip`...there are likely zip-based file types that we could do a > better job on. > Thanks to [~snagel] for a dump of stats on the most recent crawl. -- This message was sent by Atlassian Jira (v8.20.10#820010)