Thomas et. al, The wiki page looks great! Thomas thank you for giving some concrete examples with an explanation of what problem you were trying to solve. I think the community will get a lot of mileage out of the page.
-Cameron On Mon, Jun 6, 2011 at 5:10 AM, Thomas Bennett <[email protected]> wrote: > Hi, > > Yes, thanks for the help Brian! Sorry for the late reply. > > I finally got around to getting my AutoDetectProductCrawler working. > In response, Chris I hope you don't mind I've given some feedback about my > experiences with the crawler on the wiki page that you created below. I hope > thats okay. Please feel free to modify/add/revert as you wish. > > Cheers, > Tom > > > On 4 June 2011 07:40, Mattmann, Chris A (388J) < > [email protected]> wrote: > >> Brian, I created a wiki page with your guidance below: >> >> https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help >> >> Others can feel free to jump on and contribute. >> >> Cheers, >> Chris >> >> On Jun 1, 2011, at 2:20 PM, holenoter wrote: >> >> > hey thomas, >> > >> > you are using StdProductCrawler which assumes a *.met file already exist >> for each file (it has only one precondition which is the existing of the >> *.met file) . . . if you want a *.met file generated you will have to use >> one of the other 2 crawlers. running: ./crawler_launcher -psc will give you >> a list of supported crawlers. you can then run: ./crawler_launcher -h -cid >> <crawler_id> where crawler id is one of the ids from the previous command . >> . . unfortunately i don't think the other crawlers are documented all that >> extensively . . . MetExtractorProductCrawler will use a single extractor for >> all files . . . AutoDetectProductCrawler requires a mapping file to be >> filled out an mime-types defined >> > >> > * MetExtractorProductCrawler example configuration can be found in the >> source: >> > - allows you to specify how the crawler will run your extractor >> > >> https://svn.apache.org/repos/asf/oodt/trunk/metadata/src/main/resources/examples/extern-config.xml >> > >> > * AutoDetectProductCrawler example configuration can be found in the >> source: >> > - uses the same metadata extractor specification file (you will have >> one of these for each mime-type) >> > - allows you to define your mime-types -- that is, give a mime-type for >> a given filename regular expression >> > >> https://svn.apache.org/repos/asf/oodt/trunk/crawler/src/main/resources/examples/mimetypes.xml >> > >> > - your file might look something like: >> > >> > <mime-info> >> > >> > >> > >> > <mime-type type="product/hdf5"> >> > >> > >> > <glob pattern="*.h5"/> >> > >> > >> > </mime-type> >> > >> > >> > </mime-info> >> > - maps your mime-types to extractors >> > >> https://svn.apache.org/repos/asf/oodt/trunk/crawler/src/main/resources/examples/mime-extractor-map.xml >> > >> > Hope this helps . . . >> > -brian >> > >> > On Jun 01, 2011, at 12:54 PM, Thomas Bennett <[email protected]> >> wrote: >> > >> >> Hi, >> >> >> >> I've successfully got the CmdLineIngester working with an >> ExternMetExtractor (written in python): >> >> >> >> However, when I try launch the crawler I get a warning telling me the >> the preconditions for ingest have not been met. No .met file has been >> created. >> >> >> >> Two questions: >> >> 1) I'm just wondering if there is any configuration that I'm missing. >> >> 2) Where should I start hunting in the code or logs to find out why my >> met extractor was not run? >> >> >> >> Kind regards, >> >> Thomas >> >> >> >> For your reference, here is the command and output. >> >> >> >> bin$ ./crawler_launcher --crawlerId StdProductCrawler --productPath >> /usr/local/meerkat/data/staging/products/hdf5 --filemgrUrl >> http://localhost:9000 --failureDir /tmp --actionIds DeleteDataFile >> MoveDataFileToFailureDir Unique --metFileExtension met --clientTransferer >> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory >> --metExtractor org.apache.oodt.cas.metadata.extractors.ExternMetExtractor >> --metExtractorConfig >> /usr/local/meerkat/extractors/katextractor/katextractor.config >> >> http://localhost:9000 >> >> StdProductCrawler >> >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cas.crawlProductCrawler crawl >> >> INFO: Crawling /usr/local/meerkat/data/staging/products/hdf5 >> >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cascrawl.ProductCrawler >> handleFile >> >> INFO: Handling file >> /usr/local/meerkat/data/staging/products/hdf5/1263940095.h5 >> >> Jun 1, 2011 9:48:07 PM org.apache.oodt.cas.crawl.ProductCrawler >> handleFile >> >> WARNING: Failed to pass preconditions for ingest of product: >> [/usr/local/meerkat/data/staging/products/hdf5/1263940095.h5] >> >> >> >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > > -- > Thomas Bennett > > SKA South Africa > > Office : +2721 506 7341 > Mobile : +2779 523 7105 > Email : [email protected] > > -- Sent from a Tin Can attached to a String
