Hey Tom, On Feb 22, 2012, at 2:51 AM, Thomas Bennett wrote:
> Hi Sheryl and Chris, > > Thanks for the feedback. Much appreciated :) Anytime! > > I'll go ahead and make that wiki page. +1. > > Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release? +1, let's do that... Cheers, Chris > On 21 February 2012 23:48, Mattmann, Chris A (388J) > <[email protected]> wrote: > Hey Tom, > > GREAT description of how to get a MIME type added to Tika and cataloged in FM. > I'll try and add this to the wiki if you or someone else doesn't beat me to > it :) > > That being said, this is a fine approach to this use case. Sheryl's other > email > stating that newer versions of Tika understand the .h5 extension out of the > box > are correct. I think we could make this automatically supported in OODT by: > > 1. upgrading to Tika 1.0 > 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are > coordinated across the components in OODT to get on the same version. > > Until those happen, your solution is fine! > > Cheers, > Chris > > On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote: > > > Hi, > > > > I see that the file manager extracts the mime type from the product > > references that are passed to it via the xml-rcp ingestProduct call. > > > > I'm ingesting hdf5 files (ext .h5) into my archive. > > > > I've captured the methodCall and here is the actual parameter that is > > passed to the File Manager on a successful. > > > > <member> > > <name>references</name> > > ... > > <member> > > <name>mimeType</name> > > <value>application/octet-stream</value> > > </member> > > <member> > > <name>origReference</name> > > <value>file:/var/kat/data/1329472755.h5</value> > > </member> > > ... > > </member> > > > > As you can see the mimeType is detected as application/octet-stream. > > > > This mimeType is auto-detected by the CAS-Crawler (I'm using the > > AutoDetectProductCrawler crawlerId). > > > > However. I configure the Crawler policy/mimetypes.xml: > > > > <mime-info> > > <mime-type type="product/hdf5"> > > <glob pattern="\d{10}\.h5$" isregex="true"/> > > </mime-type> > > </mime-info> > > > > and policy/mime-extractor-map.xml: > > > > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas" magic="true > > or false" > > mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml"> > > <mime type="product/hdf5"> > > <extractor > > class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor"> > > <config > > file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/> > > <preCondComparators> > > <preCondComparator > > id="CheckThatDataFileSizeIsGreaterThanZero"/> > > </preCondComparators> > > </extractor> > > </mime> > > </cas:mimetypemap> > > > > The AutoDetectProductCrawler now uses this to detect the file and extract > > the metadata. However, when it comes to MimeType detection, this is done in > > the following line of code in > > org.apache.oodt.cas.filemgr.structs.Reference.java: > > > > > > try { > > this.mimeType = mimeTypeRepository > > > > .getMimeType(new URL(origRef)); > > > > } catch (MalformedURLException e) { > > > > e.printStackTrace(); > > > > } > > > > So the mime-type is actually detected by the Tika library. Woot! So Tika > > does not seem to know about .h5 files and that they are hdf5 files. > > > > Forcing a MimeType to be "application/x-hdf" in the MetaData results in the > > mimetype being appended. > > > > MimeType > > application/x-hdf > > application/octet-stream > > application > > octet-stream > > > > So my question: Is this okay? Do I live with the application/octet-stream. > > Any recommendations on how to fix this? > > > > Cheers, > > Tom > > > > > > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
