Hi Val, Please see OODT-805 and https://github.com/apache/oodt/commit/cf1220d4ac66ccefc8e510c62fb6b38cf529ffb2 for what I believe is the fix.
Can you make the MimeTypeUtils changes locally or try out trunk? Let me know! Tyler On Thu, Jan 22, 2015 at 5:40 PM, Tyler Palsulich <tpalsul...@gmail.com> wrote: > Hi Val, > > Yes, I think you've hit the nail on the head -- if Tika isn't passed your > updated mimetypes configuration file (with your custom types), then those > files will not be properly identified. I'll look into this issue more > tonight and hopefully find a fix. :) > > > by default tika only knows about xml files, text files, > application/octet-stream files. > I'm not sure what you mean by this? Tika knows about much more than that, > but is there an OODT config that overrides that? > > > I'm a newbie with Java and I can't guarantee I would be able to build a > JUnit test program very easily. But I will continue to investigate and see > what I can do. > No worries! :) If you have time and want to try your hand at it, the best > way to learn is by looking at the existing tests, like in > https://github.com/apache/oodt/blob/trunk/metadata/src/test/org/apache/oodt/cas/metadata/util/TestMimeTypeUtils.java > . > > Have a good night, > Tyler > > On Thu, Jan 22, 2015 at 2:22 PM, Mallder, Valerie < > valerie.mall...@jhuapl.edu> wrote: > >> Hi Tyler, >> >> Can you tell me more about the tika-mimetypes.xml file? Is this a new >> 'required' file? I'm not 100% sure about this yet, but it seems to me >> that, since MimeTypeUtils.java instantiates Tika with the default >> constructor, and never explicitly tells Tika which mime-types file to use >> (even though the correct mime-types.xml file is passed to the MimeTypeUtils >> constructor from MimeExtractorRepo) there is no place where the contents of >> my mime-types.xml file is being read and stored in the Tika's >> MimeTypeRegistry, and by default tika only knows about xml files, text >> files, application/octet-stream files. >> >> I will keep looking at this tomorrow and verify which the file that is >> passed to the Tika's MimeTypesFactory class, but I have to head home now. >> >> Val >> >> >> >> >> Valerie A. Mallder >> New Horizons Deputy Mission System Engineer >> Johns Hopkins University/Applied Physics Laboratory >> >> >> -----Original Message----- >> From: Mallder, Valerie >> Sent: Thursday, January 22, 2015 11:42 AM >> To: dev >> Subject: RE: Tyler - I may need your help >> >> Hi Tyler, >> >> I have defined a few custom mime types in my filemgr/etc/mime-types.xml >> file. The contents of my file looks exactly like the contents of >> http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml >> with the addition of project-specific mime-types . The tika-mimetypes.xml >> file you pointed me to has ~2000 additional lines in it as compared to the >> http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml >> file and the >> http://svn.apache.org/viewvc/oodt/tags/0.8/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/src/main/resources/etc/mime-types.xml >> file. So, it is definitely different than the one I've been using. But, I >> copied it over and added my mime types to it, and it didn't help. The mime >> types it is returning are 'reasonable' mime-types to return, they are just >> not the mime-types that I defined them as. For instance, I have *.sfdu >> files and *.out files that contain binary data, and tika says they are >> "application/octet-stream" files. I also have *.ecsv files that contain >> text, and tika says they are "text/plain" files. >> >> But here are the mime-types I defined for these files for my project, and >> these are the mime-types that have defined extractors for. None of these >> filename extensions "*.out, *.ecsv, and *.sfdu" are defined elsewhere in >> the mime-types.xml file. >> >> <mime-type type="product/fei-out"> >> <glob pattern="*.out"/> >> </mime-type> >> >> <mime-type type="product/fei-ecsv"> >> <glob pattern="*.ecsv"/> >> </mime-type> >> >> <mime-type type="product/fei-sfdu"> >> <glob pattern="*.sfdu"/> >> </mime-type> >> >> I'm a newbie with Java and I can't guarantee I would be able to build a >> JUnit test program very easily. But I will continue to investigate and see >> what I can do. >> >> Thanks! >> >> Val >> >> >> >> >> Valerie A. Mallder >> New Horizons Deputy Mission System Engineer Johns Hopkins >> University/Applied Physics Laboratory >> >> >> > -----Original Message----- >> > From: Tyler Palsulich [mailto:tpalsul...@gmail.com] >> > Sent: Wednesday, January 21, 2015 5:13 PM >> > To: dev >> > Subject: Re: Tyler - I may need your help >> > >> > Hi Val, >> > >> > Hmm... Is there a particular (wrong) mime-type that keeps getting >> > detected (like text/plain, or something)? I'm curious if the type is >> > just returning a default. Or, is it a seemingly random file type? What >> are the contents of your mime-types.xml file? >> > If it's different than >> > https://raw.githubusercontent.com/apache/tika/trunk/tika- >> > core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml, >> > can you try copying it over? >> > >> > I'm not sure I'll be able to replicate your error on my computer >> > without a bit of difficulty. Do you think there is any way you could >> > create a JUnit test case with the problem? >> > >> > Tyler >> > >> > >> > On Wed, Jan 21, 2015 at 1:26 PM, Mallder, Valerie < >> > valerie.mall...@jhuapl.edu> >> > wrote: >> > >> > > Hi Tyler, >> > > >> > > I'm have been looking into an issue that cropped up in my OODT >> > > system when I upgraded to OODT 0.8. The issue is, my >> > > AutoDetectProductCrawler, which is launched from a PGETaskInstance >> > > is unable to determine the mime-type for my product files. I am >> > > using the same filemgr/etc/mime-types.xml file that I was using with >> > > OODT 0.7, and I am using the same >> > > oodt/extensions/policy/mime-extractor-map.xml file that I was using >> > > with OODT 0.7, but now, in MimeTypeRepo::getExtractorSpecsForFile, >> > > the call to >> > > this.mimeRepo.getMimeType(file) is returning the wrong mime-types >> > > for all of my files, and so the AutoDetectProductCrawler is telling >> > > me I have no extractor specs for my files. >> > > >> > > I noticed that you did some work on MimeTypeUtils for OODT-630 in >> > > OODT 0.8. At first glance, it doesn't' look like any of this work >> > > would be directly responsible. Can you think of anything that might >> > > be causing this to happen? I don't know anything about tika. Do I >> > > need to make any changes to my policy files to remain compatible. >> > > Just looking for clues on how to resolve this. I have verified by >> > > adding log messages throughout the code that, prior to launching the >> > > AutoDetectProductCrawler, all of the policy files are read correctly. >> > > The MimeExtractorConfigReader is reading the correct >> > > mim-extractor-map.xml file, and it is calling setMimeRepoFile with >> > > the correct mime-types.xml file, and it is setting the correct >> > > extractor config file, etc. But, once AutoDetectProductCrawler >> > > starts crawling it try to getExtractorSpecsForFile but determines >> > > the wrong mime type and then >> > can't find the extractor spec. >> > > >> > > Thanks, >> > > Val >> > > >> > > >> > > >> > > Valerie A. Mallder >> > > >> > > New Horizons Deputy Mission System Engineer The Johns Hopkins >> > > University/Applied Physics Laboratory >> > > 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723 >> > > 240-228-7846 (Office) 410-504-2233 (Blackberry) >> > > >> > > >> > >