Re: Running the Crawl without using bin/nutch in side a scala program

Doğacan Güney Mon, 27 Jul 2009 07:02:58 -0700

On Mon, Jul 27, 2009 at 16:47, Sailaja
Dhiviti<sailaja_dhiv...@persistent.co.in> wrote:
> Hi ,
>        I am trying to run the crawl inside a scala program without using 
> bin/nutch command, I am adding all the environment variables which are set by 
> nutch.sh when crawl is running through bin/nutch command. And i am calling 
> the Crawl.main(prams) class and i am getting the following error Exception in 
> thread "main" java.io.IOException: Job failed!
>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1232)
>        at org.apache.nutch.crawl.Injector.inject(Injector.java:160)
>        at org.apache.nutch.crawl.Crawl.main(Crawl.java:113)
> and here is the code i am trying to write
> for {
>       (line) <- Source.fromFile("/root/classpaths.sh").getLines
>     } if(line != null){
>
>    var bo:Array[Byte] = new Array[Byte](100);
>    var cmd:Array[String] = new Array[String](3);
>     cmd(0)="bash"
>     cmd(1)="-c"
>     cmd(2)=line;
>      var checkingCrawl:Process = Runtime.getRuntime().exec(cmd);
>      }
>      var params:Array[String] = new Array[String](5);
>      params(0)="urls"
>      params(1)="-dir"
>      params(2)="insidejava"
>      params(3)="-depth"
>      params(4)="1"
>      org.apache.nutch.crawl.Crawl.main(params);
>
>
>
> contents of classpaths.sh:
>
> JAVA=$JAVA_HOME/bin/java
> JAVA_HEAP_MAX=-Xmx1000m
>
> # check envvars which might override default args
> if [ "$NUTCH_HEAPSIZE" != "" ]; then
>  #echo "run with heapsize $NUTCH_HEAPSIZE"
>  JAVA_HEAP_MAX="-Xmx""$NUTCH_HEAPSIZE""m"
>  #echo $JAVA_HEAP_MAX
> fi
>
> # CLASSPATH initially contains $NUTCH_CONF_DIR, or defaults to 
> $NUTCH_HOME/conf
> CLASSPATH=${NUTCH_CONF_DIR:=$NUTCH_HOME/conf}
> CLASSPATH=${CLASSPATH}:$JAVA_HOME/lib/tools.jar
>
> # so that filenames w/ spaces are handled correctly in loops below
> IFS=
>
> # for developers, add plugins, job & test code to CLASSPATH
> if [ -d "$NUTCH_HOME/build/plugins" ]; then
>  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build
> fi
> if [ -d "$NUTCH_HOME/build/test/classes" ]; then
>  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/test/classes
> fi
>
> if [ $IS_CORE == 0 ]
> then
>  for f in $NUTCH_HOME/build/nutch-*.job; do
>    CLASSPATH=${CLASSPATH}:$f;
>  done
>
>  # for releases, add Nutch job to CLASSPATH
>  for f in $NUTCH_HOME/nutch-*.job; do
>    CLASSPATH=${CLASSPATH}:$f;
>  done
> else
>  CLASSPATH=${CLASSPATH}:$NUTCH_HOME/build/classes
> fi
> # add plugins to classpath
> if [ -d "$NUTCH_HOME/plugins" ]; then
>  CLASSPATH=${NUTCH_HOME}:${CLASSPATH}
> fi
> # add libs to CLASSPATH
> for f in $NUTCH_HOME/lib/*.jar; do
>  CLASSPATH=${CLASSPATH}:$f;
> done
>
> for f in $NUTCH_HOME/lib/jetty-ext/*.jar; do
>  CLASSPATH=${CLASSPATH}:$f;
> done
>
> # setup 'java.library.path' for native-hadoop code if necessary
> JAVA_LIBRARY_PATH=''
> if [ -d "${NUTCH_HOME}/build/native" -o -d "${NUTCH_HOME}/lib/native" ]; then
>  JAVA_PLATFORM=`CLASSPATH=${CLASSPATH} ${JAVA} 
> org.apache.hadoop.util.PlatformName | sed -e 's/ /_/g'`
>
>  if [ -d "$NUTCH_HOME/build/native" ]; then
>    JAVA_LIBRARY_PATH=${HADOOP_HOME}/build/native/${JAVA_PLATFORM}/lib
>  fi
>
>  if [ -d "${NUTCH_HOME}/lib/native" ]; then
>    if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
>      
> JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
>    else
>      JAVA_LIBRARY_PATH=${NUTCH_HOME}/lib/native/${JAVA_PLATFORM}
>    fi
>  fi
> fi
>
> # restore ordinary behaviour
> unset IFS
>
> # default log directory & file
> if [ "$NUTCH_LOG_DIR" = "" ]; then
>  NUTCH_LOG_DIR="$NUTCH_HOME/logs"
> fi
> if [ "$NUTCH_LOGFILE" = "" ]; then
>  NUTCH_LOGFILE='hadoop.log'
> fi
> NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.dir=$NUTCH_LOG_DIR"
> NUTCH_OPTS="$NUTCH_OPTS -Dhadoop.log.file=$NUTCH_LOGFILE"
>
> if [ "x$JAVA_LIBRARY_PATH" != "x" ]; then
>  NUTCH_OPTS="$NUTCH_OPTS -Djava.library.path=$JAVA_LIBRARY_PATH"
> fi
>
>
> contents of hadoop.log:
>
> 2009-07-27 18:48:55,345 INFO  crawl.Crawl - crawl started in: insidejava
> 2009-07-27 18:48:55,347 INFO  crawl.Crawl - rootUrlDir = urls
> 2009-07-27 18:48:55,347 INFO  crawl.Crawl - threads = 10
> 2009-07-27 18:48:55,347 INFO  crawl.Crawl - depth = 1
> 2009-07-27 18:48:55,779 INFO  crawl.Injector - Injector: starting
> 2009-07-27 18:48:55,780 INFO  crawl.Injector - Injector: crawlDb: 
> insidejava/crawldb
> 2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: urlDir: urls
> 2009-07-27 18:48:55,781 INFO  crawl.Injector - Injector: Converting injected 
> urls to crawl db entries.
> 2009-07-27 18:48:55,974 WARN  mapred.JobClient - Use GenericOptionsParser for 
> parsing the arguments. Applications should implement Tool for the same.
> 2009-07-27 18:49:19,685 WARN  plugin.PluginRepository - Plugins: not a file: 
> url. Can't load plugins from: 
> jar:file:/nutch-1.0/crawler/nutch-1.0.job!/plugins
> 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Plugin 
> Auto-activation mode: [true]
> 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered Plugins:
> 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository -         NONE
> 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository - Registered 
> Extension-Points:
> 2009-07-27 18:49:19,686 INFO  plugin.PluginRepository -         NONE
> 2009-07-27 18:49:19,689 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not 
> found.
>        at org.apache.nutch.net.URLNormalizers.<init>(URLNormalizers.java:122)
>        at 
> org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
>        at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>        at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>        at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
>        at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>
>    how to solve this issue any idea please reply to this...
>


I think $nutch/build/plugins is not in your classpath, but I am not sure.

> Thanks in advance..
>
> ----Sailaja
>
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>



-- 
Doğacan Güney

Re: Running the Crawl without using bin/nutch in side a scala program

Reply via email to