[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

Gregory Lepore (Jira) Wed, 24 May 2023 09:46:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725873#comment-17725873
 ]


Gregory Lepore commented on TIKA-3992:
--------------------------------------

Looking at the full-table.csv file there are some easy wins to reduce the 
number of unidentified formats (in addition to the audio xm/x-mod work).

The following formats are cleanly identified by file and/or siegfried, but are 
currently application/octet-stream:

 
Extension     Mime     Format    Count
laz application/octet-stream    ASPRS Lidar Data Exchange Format    4006
bsp application/octet-stream    The Source Engine BSP File Format/NASA SPICE 
file (binary format)   2800
zim application/octet-stream    Zeno IMproved   2711
jdf application/octet-stream    JEOL NMR Spectroscopy   2597
blend   application/octet-stream (application/x-blender) Blender 3D 947
 
That's just looking over the most common extensions in the collection and 
verifying several files.
 
What's the best way to work through these formats and get them added to Tika? 
I'm happy to extract the format identification data from PRONOM and/or my 
research to add to Tika.
 
For example for the ASPRS Lidar Data Exchange Format, `roy inspect` (a 
companion tool to siegfried) reports:
 
roy inspect fmt/370 
ASPRS LIDAR DATA EXCHANGE FORMAT (FMT/370) 
globs: *.las, *.laz 
sigs: (B:0 seq "LASF" | P:20 seq "\x01\x02" | P:78 r "\x00" - 99)


So LASF at offset 0, plus 0102 at offset 24 and I'm not sure about the last bit.
 
The Format Documentation is at: 
[https://www.asprs.org/wp-content/uploads/2010/12/asprs_las_format_v10.pdf]
but says nothing about mime type.
 
Thoughts?
 

> Add common missing mimes based on Common Crawl data
> ---------------------------------------------------
>
>                 Key: TIKA-3992
>                 URL: https://issues.apache.org/jira/browse/TIKA-3992
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: mimes.zip
>
>
> In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as 
> detected by Tika.  It would be useful to extract those (even if truncated) 
> and run 'file' and 'siegfried' against those file types that are unknown to 
> Tika.  We can prioritize the most common file formats as identified by file 
> and siegfried for addition to our mime-types.xml.
> Separately, we might also want to do the same thing for 
> `application/zip`...there are likely zip-based file types that we could do a 
> better job on.
> Thanks to [~snagel] for a dump of stats on the most recent crawl.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3992) Add common missing mimes based on Common Crawl data

Reply via email to