On Mon, Jul 27, 2009 at 16:47, Sailaja Dhiviti<sailaja_dhiv...@persistent.co.in> wrote: > Hi , > I am trying to run the crawl inside a scala program without using > bin/nutch command, I am adding all the environment variables which are set by > nutch.sh when crawl is running through bin/nutch command. And i am calling > the Crawl.main(prams) class and i am getting the following error Exception in > thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232) > at org.apache.nutch.crawl.Injector.inject(Injector.java:160) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:113) > and here is the code i am trying to write > for { > (line) <- Source.fromFile("/root/classpaths.sh").getLines > } if(line != null){ > > var bo:Array[Byte] = new Array[Byte](100); > var cmd:Array[String] = new Array[String](3); > cmd(0)="bash" > cmd(1)="-c" > cmd(2)=line; > var checkingCrawl:Process = Runtime.getRuntime().exec(cmd); > } > var params:Array[String] = new Array[String](5); > params(0)="urls" > params(1)="-dir" > params(2)="insidejava" > params(3)="-depth" > params(4)="1" > org.apache.nutch.crawl.Crawl.main(params); > > > > contents of classpaths.sh: > > JAVA=$JAVA_HOME/bin/java > JAVA_HEAP_MAX=-Xmx1000m > > # check envvars which might override default args > if [ "$NUTCH_HEAPSIZE" != "" ]; then > #echo "run with heapsize $NUTCH_HEAPSIZE" > JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m" > #echo $JAVA_HEAP_MAX > fi > > # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to > $NUTCH_HOME/conf > CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf} > CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar > > # so that filenames w/ spaces are handled correctly in loops below > IFS= > > # for developers, add plugins, job & test code to CLASSPATH > if [ -d "$NUTCH_HOME/build/plugins" ]; then > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build > fi > if [ -d "$NUTCH_HOME/build/test/classes" ]; then > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes > fi > > if [ $IS_CORE == 0 ] > then > for f in $NUTCH_HOME/build/nutch-*.job; do > CLASSPATH=${CLASSPATH}:$f; > done > > # for releases, add Nutch job to CLASSPATH > for f in $NUTCH_HOME/nutch-*.job; do > CLASSPATH=${CLASSPATH}:$f; > done > else > CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes > fi > # add plugins to classpath > if [ -d "$NUTCH_HOME/plugins" ]; then > CLASSPATH=${NUTCH_HOME}:${CLASSPATH} > fi > # add libs to CLASSPATH > for f in $NUTCH_HOME/lib/*.jar; do > CLASSPATH=${CLASSPATH}:$f; > done > > for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do > CLASSPATH=${CLASSPATH}:$f; > done > > # setup 'java.library.path' for native-hadoop code if necessary > JAVA_LIBRARY_PATH='' > if [ -d "${NUTCH_HOME}/build/native" -o -d "${NUTCH_HOME}/lib/native" ]; then > JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} > org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'` > > if [ -d "$NUTCH_HOME/build/native" ]; then > JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib > fi > > if [ -d "${NUTCH_HOME}/lib/native" ]; then > if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then > > JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} > else > JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM} > fi > fi > fi > > # restore ordinary behaviour > unset IFS > > # default log directory & file > if [ "$NUTCH_LOG_DIR" = "" ]; then > NUTCH_LOG_DIR="$NUTCH_HOME/logs" > fi > if [ "$NUTCH_LOGFILE" = "" ]; then > NUTCH_LOGFILE='hadoop.log' > fi > NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR" > NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE" > > if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then > NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH" > fi > > > contents of hadoop.log: > > 2009-07-27 18:48:55,345 INFO crawl.Crawl - crawl started in: insidejava > 2009-07-27 18:48:55,347 INFO crawl.Crawl - rootUrlDir = urls > 2009-07-27 18:48:55,347 INFO crawl.Crawl - threads = 10 > 2009-07-27 18:48:55,347 INFO crawl.Crawl - depth = 1 > 2009-07-27 18:48:55,779 INFO crawl.Injector - Injector: starting > 2009-07-27 18:48:55,780 INFO crawl.Injector - Injector: crawlDb: > insidejava/crawldb > 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: urlDir: urls > 2009-07-27 18:48:55,781 INFO crawl.Injector - Injector: Converting injected > urls to crawl db entries. > 2009-07-27 18:48:55,974 WARN mapred.JobClient - Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the same. > 2009-07-27 18:49:19,685 WARN plugin.PluginRepository - Plugins: not a file: > url. Can't load plugins from: > jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered Plugins: > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - Registered > Extension-Points: > 2009-07-27 18:49:19,686 INFO plugin.PluginRepository - NONE > 2009-07-27 18:49:19,689 WARN mapred.LocalJobRunner - job_local_0001 > java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not > found. > at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122) > at > org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) > > how to solve this issue any idea please reply to this... >
I think $nutch/build/plugins is not in your classpath, but I am not sure. > Thanks in advance.. > > ----Sailaja > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is the > property of Persistent Systems Ltd. It is intended only for the use of the > individual or entity to which it is addressed. If you are not the intended > recipient, you are not authorized to read, retain, copy, print, distribute or > use this message. If you have received this communication in error, please > notify the sender and delete all copies of this message. Persistent Systems > Ltd. does not accept any liability for virus infected mails. > -- Doğacan Güney