The attached patches for Generator.java and Injector.java allow a
specific temporary directory to be specified. This gives Nutch the full
path to these temporary directories and seems to fix the "No input
directories" issue when using a local filesystem with multiple task
trackers.

On Mon, 2005-11-07 at 09:57 -0500, Rod Taylor wrote:
> On Fri, 2005-11-04 at 20:41 -0800, Doug Cutting wrote:
> > Rod Taylor wrote:
> > > Here you go. local filesystem and a single job tracker on another
> > > machine. When the tasktracker and jobtracker are on the same box there
> > > isn't a problem. When they are on different machines it runs into
> > > issues.
> > > 
> > > This is using mapred.local.dir on the local machine (not sharedd between
> > > sbider4 and sbider5):
> > 
> > >         parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml
> > >         [Fatal Error] :-1:-1: Premature end of file.
> > 
> > What is mapred.system.dir?  That must be shared.  Also, filenames you 
> > pass to commands must be pathnames that work on all hosts.
> 
> I managed to get past all of the initial injection problems by running a
> local crawl (no jobtracker) which created the crawldb/current/part-00000
> files. So I was able to do a real inject, with jobtracker, for all of
> the urls system wide without any complaints about files or directories
> not existing.
> 
> Now, when trying to run a generate with a jobtracker it seems to have a
> hard time finding the temporary working areas from one job to the next.
> I cannot figure out where it is creating generate-temp-908680235. With
> NDFS it would be /user/$USER/
> 
> <-- nutch generate -->
> 051107 091256 topN: 10000
> 051107 091256 Generator: starting
> 051107 091256 Generator:
> segment: /opt/sitesell/sbider_data/test2/segments/20051107091256
> 051107 091256 Generator: Selecting most-linked urls due for fetch.
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> 051107 091256 Client connection to 192.168.100.14:5464: starting
> 051107 091256 Running job: job_xhvq9b
> 051107 091258  map 0%
> 051107 091300  map 5%
> 051107 091303  map 16%
> 051107 091305  map 21%
> 051107 091306  map 26%
> 051107 091308  map 32%
> 051107 091309  map 37%
> 051107 091312  map 47%
> 051107 091315  map 58%
> 051107 091318  map 68%
> 051107 091320  map 74%
> 051107 091321  map 79%
> 051107 091324  map 89%
> 051107 091327  map 100%
> 051107 091330  reduce 5%
> 051107 091332  reduce 11%
> 051107 091333  reduce 16%
> 051107 091335  reduce 21%
> 051107 091337  reduce 26%
> 051107 091339  reduce 37%
> 051107 091342  reduce 47%
> 051107 091344  reduce 53%
> 051107 091345  reduce 58%
> 051107 091347  reduce 63%
> 051107 091348  reduce 68%
> 051107 091351  reduce 79%
> 051107 091354  reduce 89%
> 051107 091357  reduce 100%
> 051107 091359 Job complete: job_xhvq9b
> 051107 091359 Generator: Partitioning selected urls by host, for
> politeness.
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
> 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml
> Exception in thread "main" java.io.IOException: No input directories
> specified in: NutchConf: nutch-default.xml ,
> mapred-default.xml , /home/sitesell/local/jobTracker/job_h22fvi.xml ,
> nutch-site.xml
>         at org.apache.nutch.ipc.Client.call(Client.java:294)
>         at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>         at $Proxy0.submitJob(Unknown Source)
>         at
> org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>         at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:213)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:258)
> 
> [EMAIL PROTECTED] sbider_data]$
> cat /home/sitesell/local/jobTracker/job_h22fvi.xml | grep input
> <property><name>mapred.input.format.class</name><value>org.apache.nutch.mapred.SequenceFileInputFormat</value></property>
> <property><name>mapred.input.dir</name><value>generate-temp-908680235</value></property>
> <property><name>mapred.input.value.class</name><value>org.apache.nutch.io.UTF8</value></property>
> <property><name>mapred.input.key.class</name><value>org.apache.nutch.crawl.CrawlDatum</value></property>
> 
> -- 
> Rod Taylor <[EMAIL PROTECTED]>
> 
> 
-- 
Rod Taylor <[EMAIL PROTECTED]>
*** ./src/java/org/apache/nutch/crawl/Generator.java.orig	2005-10-31 23:35:20.000000000 -0500
--- ./src/java/org/apache/nutch/crawl/Generator.java	2005-11-07 17:06:46.000000000 -0500
***************
*** 155,161 ****
      throws IOException {
  
      File tempDir =
!       new File("generate-temp-"+
                 Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
  
      File segment = new File(segments, getDate());
--- 155,162 ----
      throws IOException {
  
      File tempDir =
!       new File(NutchConf.get().get("mapred.temp.dir", ".") +
!                "/generate-temp-"+
                 Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
  
      File segment = new File(segments, getDate());
*** ./src/java/org/apache/nutch/crawl/Injector.java.orig	2005-09-24 19:29:03.000000000 -0400
--- ./src/java/org/apache/nutch/crawl/Injector.java	2005-11-07 17:34:37.000000000 -0500
***************
*** 84,90 ****
      LOG.info("Injector: urlDir: " + urlDir);
  
      File tempDir =
!       new File("inject-temp-"+
                 Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
  
      // map text input file to a <url,CrawlDatum> file
--- 84,91 ----
      LOG.info("Injector: urlDir: " + urlDir);
  
      File tempDir =
!       new File(NutchConf.get().get("mapred.temp.dir", ".") +
!                "/inject-temp-"+
                 Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
  
      // map text input file to a <url,CrawlDatum> file

Reply via email to