The attached patches for Generator.java and Injector.java allow a specific temporary directory to be specified. This gives Nutch the full path to these temporary directories and seems to fix the "No input directories" issue when using a local filesystem with multiple task trackers.
On Mon, 2005-11-07 at 09:57 -0500, Rod Taylor wrote: > On Fri, 2005-11-04 at 20:41 -0800, Doug Cutting wrote: > > Rod Taylor wrote: > > > Here you go. local filesystem and a single job tracker on another > > > machine. When the tasktracker and jobtracker are on the same box there > > > isn't a problem. When they are on different machines it runs into > > > issues. > > > > > > This is using mapred.local.dir on the local machine (not sharedd between > > > sbider4 and sbider5): > > > > > parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml > > > [Fatal Error] :-1:-1: Premature end of file. > > > > What is mapred.system.dir? That must be shared. Also, filenames you > > pass to commands must be pathnames that work on all hosts. > > I managed to get past all of the initial injection problems by running a > local crawl (no jobtracker) which created the crawldb/current/part-00000 > files. So I was able to do a real inject, with jobtracker, for all of > the urls system wide without any complaints about files or directories > not existing. > > Now, when trying to run a generate with a jobtracker it seems to have a > hard time finding the temporary working areas from one job to the next. > I cannot figure out where it is creating generate-temp-908680235. With > NDFS it would be /user/$USER/ > > <-- nutch generate --> > 051107 091256 topN: 10000 > 051107 091256 Generator: starting > 051107 091256 Generator: > segment: /opt/sitesell/sbider_data/test2/segments/20051107091256 > 051107 091256 Generator: Selecting most-linked urls due for fetch. > 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml > 051107 091256 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml > 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml > 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml > 051107 091256 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml > 051107 091256 Client connection to 192.168.100.14:5464: starting > 051107 091256 Running job: job_xhvq9b > 051107 091258 map 0% > 051107 091300 map 5% > 051107 091303 map 16% > 051107 091305 map 21% > 051107 091306 map 26% > 051107 091308 map 32% > 051107 091309 map 37% > 051107 091312 map 47% > 051107 091315 map 58% > 051107 091318 map 68% > 051107 091320 map 74% > 051107 091321 map 79% > 051107 091324 map 89% > 051107 091327 map 100% > 051107 091330 reduce 5% > 051107 091332 reduce 11% > 051107 091333 reduce 16% > 051107 091335 reduce 21% > 051107 091337 reduce 26% > 051107 091339 reduce 37% > 051107 091342 reduce 47% > 051107 091344 reduce 53% > 051107 091345 reduce 58% > 051107 091347 reduce 63% > 051107 091348 reduce 68% > 051107 091351 reduce 79% > 051107 091354 reduce 89% > 051107 091357 reduce 100% > 051107 091359 Job complete: job_xhvq9b > 051107 091359 Generator: Partitioning selected urls by host, for > politeness. > 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml > 051107 091359 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml > 051107 091359 parsing file:/opt/nutch-0.8_7/conf/nutch-site.xml > Exception in thread "main" java.io.IOException: No input directories > specified in: NutchConf: nutch-default.xml , > mapred-default.xml , /home/sitesell/local/jobTracker/job_h22fvi.xml , > nutch-site.xml > at org.apache.nutch.ipc.Client.call(Client.java:294) > at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) > at $Proxy0.submitJob(Unknown Source) > at > org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) > at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) > at org.apache.nutch.crawl.Generator.generate(Generator.java:213) > at org.apache.nutch.crawl.Generator.main(Generator.java:258) > > [EMAIL PROTECTED] sbider_data]$ > cat /home/sitesell/local/jobTracker/job_h22fvi.xml | grep input > <property><name>mapred.input.format.class</name><value>org.apache.nutch.mapred.SequenceFileInputFormat</value></property> > <property><name>mapred.input.dir</name><value>generate-temp-908680235</value></property> > <property><name>mapred.input.value.class</name><value>org.apache.nutch.io.UTF8</value></property> > <property><name>mapred.input.key.class</name><value>org.apache.nutch.crawl.CrawlDatum</value></property> > > -- > Rod Taylor <[EMAIL PROTECTED]> > > -- Rod Taylor <[EMAIL PROTECTED]>
*** ./src/java/org/apache/nutch/crawl/Generator.java.orig 2005-10-31 23:35:20.000000000 -0500 --- ./src/java/org/apache/nutch/crawl/Generator.java 2005-11-07 17:06:46.000000000 -0500 *************** *** 155,161 **** throws IOException { File tempDir = ! new File("generate-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); File segment = new File(segments, getDate()); --- 155,162 ---- throws IOException { File tempDir = ! new File(NutchConf.get().get("mapred.temp.dir", ".") + ! "/generate-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); File segment = new File(segments, getDate());
*** ./src/java/org/apache/nutch/crawl/Injector.java.orig 2005-09-24 19:29:03.000000000 -0400 --- ./src/java/org/apache/nutch/crawl/Injector.java 2005-11-07 17:34:37.000000000 -0500 *************** *** 84,90 **** LOG.info("Injector: urlDir: " + urlDir); File tempDir = ! new File("inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file --- 84,91 ---- LOG.info("Injector: urlDir: " + urlDir); File tempDir = ! new File(NutchConf.get().get("mapred.temp.dir", ".") + ! "/inject-temp-"+ Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); // map text input file to a <url,CrawlDatum> file