Hi Sheryl,
First off, I tried to run crawler_launcher with an option "-autoPC".
Then I got a warning message as follows:
Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile
WARNING: Failed to pass preconditions for ingest of product:
[/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5]
Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile
INFO: Handling file
/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5.info.tmp
Aug 10, 2012 11:12:26 AM org.apache.oodt.cas.crawl.ProductCrawler handleFile
WARNING: Failed to pass preconditions for ingest of product:
[/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2/TES-Aura_L2-CO2-Nadir_r0000002147_F06_09.he5.info.tmp]
I think that the warning message is related with preconditions for ingest.
According to the run script for crawler_launcher, it was wrong to
describe the option "pids" for the preconditions.
#!/bin/sh
export STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
./crawler_launcher \
-op -stdPC \
-mfx tmp\
--productPath $STAGE_AREA\
--filemgrUrl http://localhost:8000\
--failureDir /tmp \
--actionIds DeleteDataFile MoveDataFileToFailureDir Unique \
--metFileExtension tmp \
-pids CheckThatDataFileSizeIsGreaterThanZero \
--clientTransferer
org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
Let me know how to fix the warning.
Next I appied an option for metadata crawler to the run script.
#!/bin/sh
export STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
./crawler_launcher \
-op -metPC\
-pp $STAGE_AREA\
-fm http://localhost:8000\
-mxc ../policy/crawler-config.xml\
-mx org.apache.oodt.cas.metadata.extractors.ExternMetExtractor\
-mxr ../policy/mime-extractor-map.xml\
--failureDir /tmp \
--actionIds DeleteDataFile MoveDataFileToFailureDir Unique \
--metFileExtension tmp \
--clientTransferer
org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory
I also get the error message as follows:
ERROR: Failed to launch crawler : Error creating bean with name
'MetExtractorProductCrawler' defined in file
[/home/yhkang/oodt-0.5/cas-crawler-0.5-SNAPSHOT/bin/../policy/crawler-beans.xml]:
Error setting property values; nested exception is
org.springframework.beans.PropertyBatchUpdateException; nested
PropertyAccessExceptions (1) are:
PropertyAccessException 1:
org.springframework.beans.MethodInvocationException: Property
'metExtractor' threw exception; nested exception is
org.apache.oodt.cas.metadata.exceptions.MetExtractionException: Failed
to parse config file : Failed to parser
'/home/yhkang/oodt-0.5/cas-crawler-0.5-SNAPSHOT/policy/crawler-config.xml'
: null
I just used the property file crawler-config.xml (as follows) in the
policy directory.
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:p="http://www.springframework.org/schema/p"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-2.5.xsd">
<bean
class="org.apache.oodt.cas.crawl.util.CasPropertyOverrideConfigurer"
/>
<import resource="crawler-beans.xml" />
<import resource="action-beans.xml" />
<import resource="precondition-beans.xml" />
<import resource="naming-beans.xml" />
</beans>
So I need to understand how to write some xml files(including
crawler-beans.xml, action-beans.xml, etc), which are imported into the
file crawler-config.xml .
Could you share your experience with me ?
Thanks,
Yunhee
2012/8/10 Sheryl John <[email protected]>:
> Hi Yunhee,
>
> What are the error messages you get while running the crawler?
>
> I've faced similar issues with crawler when I tried out the first time too.
> I went through the crawler user guide to understand the architecture and
> then understood how it worked only after running crawler with several times
> to ingest files.
> I agree we need to update the guide and if you want to know about the
> MetExtractorProductCrawler and AutoDetectProductCrawler, the wiki page that
> I mentioned before will give you an idea how to get it working (It mentions
> the config files that you need to write for the above two crawlers).
>
>
>
> On Thu, Aug 9, 2012 at 6:27 AM, YunHee Kang <[email protected]> wrote:
>
>> Hi Chris,
>>
>> I got a bunch of error messages when running the crawler_launcher script.
>> First off, I think I need to understand how to a crawler works.
>> Can I get some materials to help me write configuration files for
>> crawler_launcher ?
>>
>> Honestly I am not familiar with Crawler.
>> But I will try to file a JIRA issue to update the Crawler user guide.
>>
>> Thanks,
>> Yunhee
>>
>>
>>
>> 2012/8/9 Mattmann, Chris A (388J) <[email protected]>:
>> > Hi YunHee,
>> >
>> > Sorry, we need to update the docs, that is for sure. Can you help
>> > us remember by filing a JIRA issue to update the Crawler user
>> > guide and to fix the URL there?
>> >
>> > As for crawlerId, yes it's obsolete, you can find the modern
>> > 0.4 and 0.5-trunk options by running ./crawler_launcher -h
>> >
>> > Cheers,
>> > Chris
>> >
>> > On Aug 7, 2012, at 7:03 AM, YunHee Kang wrote:
>> >
>> >> Hi Chris and Sheryl,
>> >>
>> >> I understood my mistake after modifying a wrong URL with the "/".
>> >> But there is the wrong URL that is used as an option of
>> >> crawler_launcher in the apache oodt
>> >> homepage(http://oodt.apache.org/components/maven/crawler/user/).
>> >> --filemgrUrl http://localhost:9000/ \
>> >> So it made me confused.
>> >>
>> >> I tried to run the command mentioned below according to the home
>> >> page of apache oodt.
>> >> $ ./crawler_launcher --crawlerId MetExtractorProductCrawler
>> >> ERROR: Invalid option: 'crawlerId'
>> >>
>> >> But the error described above was occurred.
>> >> Is the option 'crawlerid' obsolete ?
>> >>
>> >> Thanks,
>> >> Yunhee
>> >>
>> >>
>> >> 2012/8/7 Mattmann, Chris A (388J) <[email protected]>:
>> >>> Perfect, Sheryl, my thoughts exactly.
>> >>>
>> >>> Cheers,
>> >>> Chris
>> >>>
>> >>> On Aug 6, 2012, at 10:01 AM, Sheryl John wrote:
>> >>>
>> >>>> Hi Yunhee,
>> >>>>
>> >>>> Check out this OODT wiki for crawler :
>> >>>> https://cwiki.apache.org/confluence/display/OODT/OODT+Crawler+Help
>> >>>>
>> >>>> Did you try giving 'http://localhost:8000' without the "/" in the
>> end?
>> >>>> Also, specify
>> 'org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferFactory'
>> >>>> for 'clientTransferer' option.
>> >>>>
>> >>>>
>> >>>> On Mon, Aug 6, 2012 at 9:46 AM, YunHee Kang <[email protected]>
>> wrote:
>> >>>>
>> >>>>> Hi Chris,
>> >>>>>
>> >>>>> I got an error message when I tried to run crawler_launcher by using
>> a
>> >>>>> shell script. The error message may be caused by a wrong URL of
>> >>>>> filemgr.
>> >>>>> $ ./crawler_launcher.sh
>> >>>>> ERROR: Validation Failures: - Value 'http://localhost:8000/' is not
>> >>>>> allowed for option
>> >>>>> [longOption='filemgrUrl',shortOption='fm',description='File Manager
>> >>>>> URL'] - Allowed values = [http://.*:\d*]
>> >>>>>
>> >>>>> The following is the shell script that I wrote:
>> >>>>> $ cat crawler_launcher.sh
>> >>>>> #!/bin/sh
>> >>>>> export STAGE_AREA=/home/yhkang/oodt-0.5/cas-pushpull/staging/TESL2CO2
>> >>>>> ./crawler_launcher \
>> >>>>> -op --launchStdCrawler \
>> >>>>> --productPath $STAGE_AREA\
>> >>>>> --filemgrUrl http://localhost:8000/\
>> >>>>> --failureDir /tmp \
>> >>>>> --actionIds DeleteDataFile MoveDataFileToFailureDir Unique \
>> >>>>> --metFileExtension tmp \
>> >>>>> --clientTransferer
>> >>>>> org.apache.oodt.cas.filemgr.datatransfer.LocalDataTransferer
>> >>>>>
>> >>>>> I am wondering if there is a problem in the URL of the filemgr or
>> elsewhere
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Yunhee
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> -Sheryl
>> >>>
>> >>>
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Chris Mattmann, Ph.D.
>> >>> Senior Computer Scientist
>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>> Office: 171-266B, Mailstop: 171-246
>> >>> Email: [email protected]
>> >>> WWW: http://sunset.usc.edu/~mattmann/
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>> Adjunct Assistant Professor, Computer Science Department
>> >>> University of Southern California, Los Angeles, CA 90089 USA
>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >
>> >
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Chris Mattmann, Ph.D.
>> > Senior Computer Scientist
>> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> > Office: 171-266B, Mailstop: 171-246
>> > Email: [email protected]
>> > WWW: http://sunset.usc.edu/~mattmann/
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> > Adjunct Assistant Professor, Computer Science Department
>> > University of Southern California, Los Angeles, CA 90089 USA
>> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>>
>
>
>
> --
> -Sheryl