$nutch/build/plugins is mentioned in my classpath but its still showing the error is there any other approach to implement the crawl with out using crawl -----Original Message----- From: Doğacan Güney [mailto:doga...@gmail.com] Sent: Monday, July 27, 2009 7:32 PM To: nutch-dev@lucene.apache.org Subject: Re: Running the Crawl without using bin/nutch in side a scala program
On Mon, Jul 27, 2009 at 16:47, Sailaja Dhiviti<sailaja_dhiv...@persistent.co.in> wrote: > Hi , > I am trying to run the crawl inside a scala program without using > bin/nutch command, I am adding all the environment variables which are set by > nutch.sh when crawl is running through bin/nutch command. And i am calling > the Crawl.main(prams) class and i am getting the following error Exception in > thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > at org.apache.nutch.crawl.Injector.inject(Injector.java:160) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) > and here is the code i am trying to write > for { > (line) <- Source.fromFile("/root/classpaths.sh").getLines > } if(line != null){ > > var bo:Array[Byte] = new Array[Byte](100); > var cmd:Array[String] = new Array[String](3); > cmd(0)="bash" > cmd(1)="-c" > cmd(2)=line; > var checkingCrawl:Process = Runtime.getRuntime().exec(cmd); > } > var params:Array[String] = new Array[String](5); > params(0)="urls" > params(1)="-dir" > params(2)="insidejava" > params(3)="-depth" > params(4)="1" > org.apache.nutch.crawl.Crawl.main(params); > > > > contents of classpaths.sh: > > JAVA=$JAVA_HOME/bin/java > JAVA_HEAP_MAX=-Xmx1000m > > # check envvars which might override default args > if [ "$NUTCH_HEAPSIZE" != "" ]; then > #echo "run with heapsize $NUTCH_HEAPSIZE" > JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" > #echo $JAVA_HEAP_MAX > fi > > # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to > $NUTCH_HOME/conf > CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} > CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar > > # so that filenames w/ spaces are handled correctly in loops below > IFS= > > # for developers, add plugins, job & test code to CLASSPATH > if [ -d "$NUTCH_HOME/build/plugins" ]; then > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build > fi > if [ -d "$NUTCH_HOME/build/test/classes" ]; then > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes > fi > > if [ $IS_CORE == 0 ] > then > for f in $NUTCH_HOME/build/nutch-*.job; do > CLASSPATH=${CLASSPATH}:$f; > done > > # for releases, add Nutch job to CLASSPATH > for f in $NUTCH_HOME/nutch-*.job; do > CLASSPATH=${CLASSPATH}:$f; > done > else > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes > fi > # add plugins to classpath > if [ -d "$NUTCH_HOME/plugins" ]; then > CLASSPATH=${NUTCH_HOME}:${CLASSPATH} > fi > # add libs to CLASSPATH > for f in $NUTCH_HOME/lib/*.jar; do > CLASSPATH=${CLASSPATH}:$f; > done > > for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do > CLASSPATH=${CLASSPATH}:$f; > done > > # setup 'java.library.path' for native-hadoop code if necessary > JAVA_LIBRARY_PATH='' > if [ -d "${NUTCH_HOME}/build/native" -o -d "${NUTCH_HOME}/lib/native" ]; then > JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} > org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` > > if [ -d "$NUTCH_HOME/build/native" ]; then > JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib > fi > > if [ -d "${NUTCH_HOME}/lib/native" ]; then > if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then > > JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} > else > JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} > fi > fi > fi > > # restore ordinary behaviour > unset IFS > > # default log directory & file > if [ "$NUTCH_LOG_DIR" = "" ]; then > NUTCH_LOG_DIR="$NUTCH_HOME/logs" > fi > if [ "$NUTCH_LOGFILE" = "" ]; then > NUTCH_LOGFILE='hadoop.log' > fi > NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR" > NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE" > > if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then > NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" > fi > > > contents of hadoop.log: > > 2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava > 2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls > 2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10 > 2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1 > 2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting > 2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb: > insidejava/crawldb > 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls > 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file: > url. Can't load plugins from: > jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins: > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered > Extension-Points: > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE > 2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001 > java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not > found. > at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122) > at > org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) > > how to solve this issue any idea please reply to this... > I think $nutch/build/plugins is not in your classpath, but I am not sure. > Thanks in advance.. > > ----Sailaja > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is the > property of Persistent Systems Ltd. It is intended only for the use of the > individual or entity to which it is addressed. If you are not the intended > recipient, you are not authorized to read, retain, copy, print, distribute or > use this message. If you have received this communication in error, please > notify the sender and delete all copies of this message. Persistent Systems > Ltd. does not accept any liability for virus infected mails. > -- Doğacan Güney DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.