Hi Sheryl and Chris,

Thanks for the feedback. Much appreciated :)

I'll go ahead and make that wiki page.

Should I open a JIRA issue relating to upgrading Tika? For the 0.5 release?

Cheers,
Tom

On 21 February 2012 23:48, Mattmann, Chris A (388J) <
[email protected]> wrote:

> Hey Tom,
>
> GREAT description of how to get a MIME type added to Tika and cataloged in
> FM.
> I'll try and add this to the wiki if you or someone else doesn't beat me
> to it :)
>
> That being said, this is a fine approach to this use case. Sheryl's other
> email
> stating that newer versions of Tika understand the .h5 extension out of
> the box
> are correct. I think we could make this automatically supported in OODT by:
>
> 1. upgrading to Tika 1.0
> 2. filing a JIRA issue associated with #1 and making sure Tika upgrades are
> coordinated across the components in OODT to get on the same version.
>
> Until those happen, your solution is fine!
>
> Cheers,
> Chris
>
> On Feb 21, 2012, at 5:20 AM, Thomas Bennett wrote:
>
> > Hi,
> >
> > I see that the file manager extracts the mime type from the product
> references that are passed to it via the xml-rcp ingestProduct call.
> >
> > I'm ingesting hdf5 files (ext .h5) into my archive.
> >
> > I've captured the methodCall and here is the actual parameter that is
> passed to the File Manager on a successful.
> >
> > <member>
> >     <name>references</name>
> >        ...
> >                         <member>
> >                             <name>mimeType</name>
> >                             <value>application/octet-stream</value>
> >                         </member>
> >                         <member>
> >                             <name>origReference</name>
> >
> <value>file:/var/kat/data/1329472755.h5</value>
> >                         </member>
> >        ...
> > </member>
> >
> > As you can see the mimeType is detected as application/octet-stream.
> >
> > This mimeType is auto-detected by the CAS-Crawler (I'm using the
> AutoDetectProductCrawler crawlerId).
> >
> > However. I configure the Crawler policy/mimetypes.xml:
> >
> > <mime-info>
> >       <mime-type type="product/hdf5">
> >               <glob pattern="\d{10}\.h5$" isregex="true"/>
> >       </mime-type>
> > </mime-info>
> >
> > and policy/mime-extractor-map.xml:
> >
> > <cas:mimetypemap xmlns:cas="http://oodt.jpl.nassa.gov/1.0/cas";
> magic="true or false"
> mimeRepo="/var/kat/katconfig/static/oodt/cas-crawler/policy/mimetypes.xml">
> >       <mime type="product/hdf5">
> >               <extractor
> class="org.apache.oodt.cas.metadata.extractors.ExternMetExtractor">
> >                       <config
> file="/var/kat/katconfig/static/oodt/cas-extractors/katfile/katfile.config"/>
> >                       <preCondComparators>
> >                               <preCondComparator
> id="CheckThatDataFileSizeIsGreaterThanZero"/>
> >                       </preCondComparators>
> >               </extractor>
> >       </mime>
> > </cas:mimetypemap>
> >
> > The AutoDetectProductCrawler now uses this to detect the file and
> extract the metadata. However, when it comes to MimeType detection, this is
> done in the following line of code in
> org.apache.oodt.cas.filemgr.structs.Reference.java:
> >
> >
> >         try {
> >             this.mimeType = mimeTypeRepository
> >
> >                     .getMimeType(new URL(origRef));
> >
> >         } catch (MalformedURLException e) {
> >
> >             e.printStackTrace();
> >
> >         }
> >
> > So the mime-type is actually detected by the Tika library. Woot! So Tika
> does not seem to know about .h5 files and that they are hdf5 files.
> >
> > Forcing a MimeType to be "application/x-hdf" in the MetaData results in
> the mimetype being appended.
> >
> > MimeType
> > application/x-hdf
> > application/octet-stream
> > application
> > octet-stream
> >
> > So my question: Is this okay? Do I live with the
> application/octet-stream. Any recommendations on how to fix this?
> >
> > Cheers,
> > Tom
> >
> >
> >
> >
> >
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [email protected]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

Reply via email to