Hey Tom,
GREAT description of how to get a MIME type added to Tika and cataloged in FM.
I'll try and add this to the wiki if you or someone else doesn't beat me to it
:)
That being said, this is a fine approach to this use case. Sheryl's other email
stating that newer versions of Tika understand the .h5 extension out of the box
are correct. I think we could make this automatically supported in OODT by:
1. upgrading to Tika 1.0
2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
coordinated across the components in OODT to get on the same version.
Until those happen, your solution is fine!
Cheers,
Chris
On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
> Hi,
>
> I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
>
> I'm ingesting hdf5 files (ext .h5) into my archive.
>
> I've captured the methodCall and here is the actual parameter that is passed
> to the File Manager on a successful.
>
> <member>
> <name>references</name>
> ...
> <member>
> <name>mimeType</name>
> <value>application/octet-stream</value>
> </member>
> <member>
> <name>origReference</name>
> <value>file:/var/kat/data/1329472755.h5</value>
> </member>
> ...
> </member>
>
> As you can see the mimeType is detected as application/octet-stream.
>
> This mimeType is auto-detected by the CAS-Crawler (I'm using the
> AutoDetectProductCrawler crawlerId).
>
> However. I configure the Crawler policy/mimetypes.xml:
>
> <mime-info>
> <mime-type type="product/hdf5">
> <glob pattern="\d{10}\.h5$" isregex="true"/>
> </mime-type>
> </mime-info>
>
> and policy/mime-extractor-map.xml:
>
> <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true or
> false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> <mime type="product/hdf5">
> <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> <preCondComparators>
> <preCondComparator
> id="CheckThatDataFileSizeIsGreaterThanZero"/>
> </preCondComparators>
> </extractor>
> </mime>
> </cas:mimetypemap>
>
> The AutoDetectProductCrawler now uses this to detect the file and extract the
> metadata. However, when it comes to MimeType detection, this is done in the
> following line of code in org.apache.oodt.cas.filemgr.structs.Reference.java:
>
>
> try {
> this.mimeType = mimeTypeRepository
>
> .getMimeType(new URL(origRef));
>
> } catch (MalformedURLException e) {
>
> e.printStackTrace();
>
> }
>
> So the mime-type is actually detected by the Tika library. Woot! So Tika does
> not seem to know about .h5 files and that they are hdf5 files.
>
> Forcing a MimeType to be "application/x-hdf" in the MetaData results in the
> mimetype being appended.
>
> MimeType
> application/x-hdf
> application/octet-stream
> application
> octet-stream
>
> So my question: Is this okay? Do I live with the application/octet-stream.
> Any recommendations on how to fix this?
>
> Cheers,
> Tom
>
>
>
>
>
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++