Hi, On Mon, Jul 20, 2009 at 19:55, Fred Kuipers<[email protected]> wrote: > Hello all, > > I'm attempting to index a large internal website with 6.7 m urls and I'm > running into a map failure after fetching (for 5+ days): > > 2009-07-20 07:09:23,316 INFO fetcher.Fetcher - -activeThreads=0 > 2009-07-20 07:09:23,806 WARN mapred.LocalJobRunner - job_local_0005 > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for > taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) > > hadoop-site.xml: > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <!-- > We need LOTS of memory... And we need to disable the gc overhead limit, per > this page: > http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom > --> > <property> > <name>mapred.child.java.opts</name> > <value>-Xmx4096m -XX:-UseGCOverheadLimit</value> > </property> > > </configuration> > > nutch-site.xml (excluding http.agent directives for brevity): > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <!-- http.agent properties excluded --> > > <property> > <name>http.timeout</name> > <value>20000</value> > <description>The default network timeout, in milliseconds.</description> > </property> > > <property> > <name>fetcher.threads.fetch</name> > <value>20</value> > <description>The number of FetcherThreads the fetcher should use. > This is also determines the maximum number of requests that are > made at once (each FetcherThread handles one connection).</description> > </property> > > <property> > <name>fetcher.threads.per.host</name> > <value>20</value> > <description>This number is the maximum number of threads that > should be allowed to access a host at one time.</description> > </property> > > <property> > <name>fetcher.server.delay</name> > <value>0.1</value> > <description>The number of seconds the fetcher will delay between > successive requests to the same server.</description> > </property> > > </configuration> > > Relevant environment variables: > NUTCH_JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64 > NUTCH_HEAPSIZE=3072 > JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64 > > I ran nutch with the following command/cwd: > [/home/fred/nutch-1.0]$ bin/nutch crawl urls_wiki_mirror -dir > crawl_wiki_mirror -threads 3 -depth 1 > > The seed file in urls_wiki_mirror contains 6739469 urls... Those are the > only urls I wish to crawl -- hence depth 1. The configuration I have set up > allows me to crawl this local server with 3 fetchers at the same time at a > rate that it doesn't overwhelm the server. > > I'm using defaults for temp directories. Thus, /tmp/hadoop-fred/ is the temp > file location. The error message notes the following partial path: > taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out > > I figure that equates to this full path: > /tmp/hadoop-fred/mapred/local/taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/ > > The contents of this directory is spill[0-906].out... Nothing else. No > file.out. There is 68G of data in this folder (ie. it looks to have > downloaded everything i need)... There is 9+ GB of free space on the > filesystem -- is it possible this is insufficient? >
It is possible that you ran out of space, it is also possible that you ran into a hadoop bug. From the logs, it doesn't seem like a nutch bug. > So, what happened? Is there a way I can recover without re-crawling? > You can try this tool: http://issues.apache.org/jira/browse/NUTCH-451 There is no guarantee that it will work though. > I am running on a Fedora Core 8 virtual machine with two cores, 4 GB memory. > > Let me know if any more information is needed... > Can you try crawling in smaller units? i.e, crawl 1m docs then crawl the second 1m docs, etc? > Thanks, > /FjK > -- Doğacan Güney
