Hi,

On Mon, Jul 20, 2009 at 19:55, Fred Kuipers<[email protected]> wrote:
> Hello all,
>
> I'm attempting to index a large internal website with 6.7 m urls and I'm
> running into a map failure after fetching (for 5+ days):
>
> 2009-07-20 07:09:23,316 INFO  fetcher.Fetcher - -activeThreads=0
> 2009-07-20 07:09:23,806 WARN  mapred.LocalJobRunner - job_local_0005
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for
> taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out
>       at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
>       at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
>       at
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
>       at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
>       at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
>       at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>
> hadoop-site.xml:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <!--
> We need LOTS of memory... And we need to disable the gc overhead limit, per
> this page:
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom
> -->
> <property>
>  <name>mapred.child.java.opts</name>
>  <value>-Xmx4096m -XX:-UseGCOverheadLimit</value>
> </property>
>
> </configuration>
>
> nutch-site.xml (excluding http.agent directives for brevity):
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <!-- http.agent properties excluded -->
>
> <property>
>  <name>http.timeout</name>
>  <value>20000</value>
>  <description>The default network timeout, in milliseconds.</description>
> </property>
>
> <property>
>  <name>fetcher.threads.fetch</name>
>  <value>20</value>
>  <description>The number of FetcherThreads the fetcher should use.
>   This is also determines the maximum number of requests that are
>   made at once (each FetcherThread handles one connection).</description>
> </property>
>
> <property>
>  <name>fetcher.threads.per.host</name>
>  <value>20</value>
>  <description>This number is the maximum number of threads that
>   should be allowed to access a host at one time.</description>
> </property>
>
> <property>
>  <name>fetcher.server.delay</name>
>  <value>0.1</value>
>  <description>The number of seconds the fetcher will delay between
>  successive requests to the same server.</description>
> </property>
>
> </configuration>
>
> Relevant environment variables:
> NUTCH_JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
> NUTCH_HEAPSIZE=3072
> JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
>
> I ran nutch with the following command/cwd:
> [/home/fred/nutch-1.0]$ bin/nutch crawl urls_wiki_mirror -dir
> crawl_wiki_mirror -threads 3 -depth 1
>
> The seed file in urls_wiki_mirror contains 6739469 urls... Those are the
> only urls I wish to crawl -- hence depth 1. The configuration I have set up
> allows me to crawl this local server with 3 fetchers at the same time at a
> rate that it doesn't overwhelm the server.
>
> I'm using defaults for temp directories. Thus, /tmp/hadoop-fred/ is the temp
> file location. The error message notes the following partial path:
> taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out
>
> I figure that equates to this full path:
> /tmp/hadoop-fred/mapred/local/taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/
>
> The contents of this directory is spill[0-906].out... Nothing else. No
> file.out. There is 68G of data in this folder (ie. it looks to have
> downloaded everything i need)... There is 9+ GB of free space on the
> filesystem -- is it possible this is insufficient?
>

It is possible that you ran out of space, it is also possible that you ran into
a hadoop bug. From the logs, it doesn't seem like a nutch bug.

> So, what happened? Is there a way I can recover without re-crawling?
>

You can try this tool:

http://issues.apache.org/jira/browse/NUTCH-451

There is no guarantee that it will work though.

> I am running on a Fedora Core 8 virtual machine with two cores, 4 GB memory.
>
> Let me know if any more information is needed...
>

Can you try crawling in smaller units? i.e, crawl 1m docs then crawl
the second 1m docs, etc?

> Thanks,
> /FjK
>



-- 
Doğacan Güney

Reply via email to