Hi Sheryl and Chris, Thanks for the feedback. Much appreciated :)
I'll go ahead and make that wiki page. Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release? Cheers, Tom On 21 February 2012 23:48, Mattmann, Chris A (388J) < [email protected]> wrote: > Hey Tom, > > GREAT description of how to get a MIME type added to Tika and cataloged in > FM. > I'll try and add this to the wiki if you or someone else doesn't beat me > to it :) > > That being said, this is a fine approach to this use case. Sheryl's other > email > stating that newer versions of Tika understand the .h5 extension out of > the box > are correct. I think we could make this automatically supported in OODT by: > > 1. upgrading to Tika 1.0 > 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are > coordinated across the components in OODT to get on the same version. > > Until those happen, your solution is fine! > > Cheers, > Chris > > On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote: > > > Hi, > > > > I see that the file manager extracts the mime type from the product > references that are passed to it via the xml-rcp ingestProduct call. > > > > I'm ingesting hdf5 files (ext .h5) into my archive. > > > > I've captured the methodCall and here is the actual parameter that is > passed to the File Manager on a successful. > > > > <member> > > <name>references</name> > > ... > > <member> > > <name>mimeType</name> > > <value>application/octet-stream</value> > > </member> > > <member> > > <name>origReference</name> > > > <value>file:/var/kat/data/1329472755.h5</value> > > </member> > > ... > > </member> > > > > As you can see the mimeType is detected as application/octet-stream. > > > > This mimeType is auto-detected by the CAS-Crawler (I'm using the > AutoDetectProductCrawler crawlerId). > > > > However. I configure the Crawler policy/mimetypes.xml: > > > > <mime-info> > > <mime-type type="product/hdf5"> > > <glob pattern="\d{10}\.h5$" isregex="true"/> > > </mime-type> > > </mime-info> > > > > and policy/mime-extractor-map.xml: > > > > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" > magic="true or false" > mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml"> > > <mime type="product/hdf5"> > > <extractor > class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor"> > > <config > file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/> > > <preCondComparators> > > <preCondComparator > id="CheckThatDataFileSizeIsGreaterThanZero"/> > > </preCondComparators> > > </extractor> > > </mime> > > </cas:mimetypemap> > > > > The AutoDetectProductCrawler now uses this to detect the file and > extract the metadata. However, when it comes to MimeType detection, this is > done in the following line of code in > org.apache.oodt.cas.filemgr.structs.Reference.java: > > > > > > try { > > this.mimeType = mimeTypeRepository > > > > .getMimeType(new URL(origRef)); > > > > } catch (MalformedURLException e) { > > > > e.printStackTrace(); > > > > } > > > > So the mime-type is actually detected by the Tika library. Woot! So Tika > does not seem to know about .h5 files and that they are hdf5 files. > > > > Forcing a MimeType to be "application/x-hdf" in the MetaData results in > the mimetype being appended. > > > > MimeType > > application/x-hdf > > application/octet-stream > > application > > octet-stream > > > > So my question: Is this okay? Do I live with the > application/octet-stream. Any recommendations on how to fix this? > > > > Cheers, > > Tom > > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >
