Hi Val,

Please see OODT-805 and
https://github.com/apache/oodt/commit/cf1220d4ac66ccefc8e510c62fb6b38cf529ffb2
for what I believe is the fix.

Can you make the MimeTypeUtils changes locally or try out trunk?

Let me know!
Tyler

On Thu, Jan 22, 2015 at 5:40 PM, Tyler Palsulich <tpalsul...@gmail.com>
wrote:

> Hi Val,
>
> Yes, I think you've hit the nail on the head -- if Tika isn't passed your
> updated mimetypes configuration file (with your custom types), then those
> files will not be properly identified. I'll look into this issue more
> tonight and hopefully find a fix. :)
>
> > by default tika only knows about xml files, text files,
> application/octet-stream files.
> I'm not sure what you mean by this? Tika knows about much more than that,
> but is there an OODT config that overrides that?
>
> > I'm a newbie with Java and I can't guarantee I would be able to build a
> JUnit test program very easily. But I will continue to investigate and see
> what I can do.
> No worries! :) If you have time and want to try your hand at it, the best
> way to learn is by looking at the existing tests, like in
> https://github.com/apache/oodt/blob/trunk/metadata/src/test/org/apache/oodt/cas/metadata/util/TestMimeTypeUtils.java
> .
>
> Have a good night,
> Tyler
>
> On Thu, Jan 22, 2015 at 2:22 PM, Mallder, Valerie <
> valerie.mall...@jhuapl.edu> wrote:
>
>> Hi Tyler,
>>
>> Can you tell me more about the tika-mimetypes.xml file? Is this a new
>> 'required' file?  I'm not 100% sure about this yet, but it seems to me
>> that, since MimeTypeUtils.java instantiates Tika with the default
>> constructor, and never explicitly tells Tika which mime-types file to use
>> (even though the correct mime-types.xml file is passed to the MimeTypeUtils
>> constructor from MimeExtractorRepo) there is no place where the contents of
>> my mime-types.xml file is being read and stored in the Tika's
>> MimeTypeRegistry, and by default tika only knows about xml files, text
>> files, application/octet-stream files.
>>
>> I will keep looking at this tomorrow and verify which the file that is
>> passed to the Tika's MimeTypesFactory class, but I have to head home now.
>>
>> Val
>>
>>
>>
>>
>> Valerie A. Mallder
>> New Horizons Deputy Mission System Engineer
>> Johns Hopkins University/Applied Physics Laboratory
>>
>>
>> -----Original Message-----
>> From: Mallder, Valerie
>> Sent: Thursday, January 22, 2015 11:42 AM
>> To: dev
>> Subject: RE: Tyler - I may need your help
>>
>> Hi Tyler,
>>
>> I have defined a few custom mime types in my filemgr/etc/mime-types.xml
>> file. The contents of my file looks exactly like the contents of
>> http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml
>> with the addition of project-specific mime-types .  The tika-mimetypes.xml
>> file you pointed me to has ~2000 additional lines in it as compared to the
>> http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml
>> file and the
>> http://svn.apache.org/viewvc/oodt/tags/0.8/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/src/main/resources/etc/mime-types.xml
>> file. So, it is definitely different than the one I've been using. But, I
>> copied it over and added my mime types to it, and it didn't help.  The mime
>> types it is returning are 'reasonable' mime-types to return, they are just
>> not the mime-types that I defined them as.  For instance, I have *.sfdu
>> files and *.out files that contain binary data, and tika says they are
>> "application/octet-stream" files.  I also have *.ecsv files that contain
>> text, and tika says they are "text/plain" files.
>>
>> But here are the mime-types I defined for these files for my project, and
>> these are the mime-types that have defined extractors for.  None of these
>> filename extensions "*.out, *.ecsv, and *.sfdu" are defined elsewhere in
>> the mime-types.xml file.
>>
>> <mime-type type="product/fei-out">
>>     <glob pattern="*.out"/>
>> </mime-type>
>>
>> <mime-type type="product/fei-ecsv">
>>     <glob pattern="*.ecsv"/>
>> </mime-type>
>>
>> <mime-type type="product/fei-sfdu">
>>      <glob pattern="*.sfdu"/>
>> </mime-type>
>>
>> I'm a newbie with Java and I can't guarantee I would be able to build a
>> JUnit test program very easily. But I will continue to investigate and see
>> what I can do.
>>
>> Thanks!
>>
>> Val
>>
>>
>>
>>
>> Valerie A. Mallder
>> New Horizons Deputy Mission System Engineer Johns Hopkins
>> University/Applied Physics Laboratory
>>
>>
>> > -----Original Message-----
>> > From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
>> > Sent: Wednesday, January 21, 2015 5:13 PM
>> > To: dev
>> > Subject: Re: Tyler - I may need your help
>> >
>> > Hi Val,
>> >
>> > Hmm... Is there a particular (wrong) mime-type that keeps getting
>> > detected (like text/plain, or something)? I'm curious if the type is
>> > just returning a default. Or, is it a seemingly random file type? What
>> are the contents of your mime-types.xml file?
>> > If it's different than
>> > https://raw.githubusercontent.com/apache/tika/trunk/tika-
>> > core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml,
>> > can you try copying it over?
>> >
>> > I'm not sure I'll be able to replicate your error on my computer
>> > without a bit of difficulty. Do you think there is any way you could
>> > create a JUnit test case with the problem?
>> >
>> > Tyler
>> >
>> >
>> > On Wed, Jan 21, 2015 at 1:26 PM, Mallder, Valerie <
>> > valerie.mall...@jhuapl.edu>
>> > wrote:
>> >
>> > > Hi Tyler,
>> > >
>> > > I'm have been looking into an issue that cropped up in my OODT
>> > > system when I upgraded to OODT 0.8. The issue is, my
>> > > AutoDetectProductCrawler, which is launched from a PGETaskInstance
>> > > is unable to determine the mime-type for my product files.  I am
>> > > using the same filemgr/etc/mime-types.xml file that I was using with
>> > > OODT 0.7, and I am using the same
>> > > oodt/extensions/policy/mime-extractor-map.xml file that I was using
>> > > with OODT 0.7, but now, in MimeTypeRepo::getExtractorSpecsForFile,
>> > > the call to
>> > > this.mimeRepo.getMimeType(file) is returning the wrong mime-types
>> > > for all of my files, and so the AutoDetectProductCrawler is telling
>> > > me I have no extractor specs for my files.
>> > >
>> > > I noticed that you did some work on MimeTypeUtils for OODT-630 in
>> > > OODT 0.8. At first glance, it doesn't' look like any of this work
>> > > would be directly responsible. Can you think of anything that might
>> > > be causing this to happen? I don't know anything about tika. Do I
>> > > need to make any changes to my policy files to remain compatible.
>> > > Just looking for clues on how to resolve this.  I have verified by
>> > > adding log messages throughout the code that, prior to launching the
>> > > AutoDetectProductCrawler, all of the policy files are read correctly.
>> > > The MimeExtractorConfigReader is reading the correct
>> > > mim-extractor-map.xml file, and it is calling setMimeRepoFile with
>> > > the correct mime-types.xml file, and it is setting the correct
>> > > extractor config file, etc. But, once AutoDetectProductCrawler
>> > > starts crawling it try to getExtractorSpecsForFile but determines
>> > > the wrong mime type and then
>> > can't find the extractor spec.
>> > >
>> > > Thanks,
>> > > Val
>> > >
>> > >
>> > >
>> > > Valerie A. Mallder
>> > >
>> > > New Horizons Deputy Mission System Engineer The Johns Hopkins
>> > > University/Applied Physics Laboratory
>> > > 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723
>> > > 240-228-7846 (Office) 410-504-2233 (Blackberry)
>> > >
>> > >
>>
>
>

Reply via email to