Nutch 1.0 Fetch failure...

Fred Kuipers Mon, 20 Jul 2009 09:55:36 -0700

Hello all,

I'm attempting to index a large internal website with 6.7 m urls and I'mrunning into a map failure after fetching (for 5+ days):


2009-07-20 07:09:23,316 INFO  fetcher.Fetcher - -activeThreads=0
2009-07-20 07:09:23,806 WARN  mapred.LocalJobRunner - job_local_0005

org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not findany valid local directory fortaskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.outatorg.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)atorg.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)atorg.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)atorg.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)

       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)


hadoop-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--

We need LOTS of memory... And we need to disable the gc overhead limit,per this page:http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html#par_gc.oom-->

<property>
 <name>mapred.child.java.opts</name>
 <value>-Xmx4096m -XX:-UseGCOverheadLimit</value>
</property>

</configuration>

nutch-site.xml (excluding http.agent directives for brevity):

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<!-- http.agent properties excluded -->

<property>
 <name>http.timeout</name>
 <value>20000</value>
 <description>The default network timeout, in milliseconds.</description>
</property>

<property>
 <name>fetcher.threads.fetch</name>
 <value>20</value>
 <description>The number of FetcherThreads the fetcher should use.
   This is also determines the maximum number of requests that are
   made at once (each FetcherThread handles one connection).</description>
</property>

<property>
 <name>fetcher.threads.per.host</name>
 <value>20</value>
 <description>This number is the maximum number of threads that
   should be allowed to access a host at one time.</description>
</property>

<property>
 <name>fetcher.server.delay</name>
 <value>0.1</value>
 <description>The number of seconds the fetcher will delay between
  successive requests to the same server.</description>
</property>

</configuration>

Relevant environment variables:
NUTCH_JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64
NUTCH_HEAPSIZE=3072
JAVA_HOME=/usr/lib/jvm/jre-1.7.0-icedtea.x86_64

I ran nutch with the following command/cwd:

[/home/fred/nutch-1.0]$ bin/nutch crawl urls_wiki_mirror -dircrawl_wiki_mirror -threads 3 -depth 1

The seed file in urls_wiki_mirror contains 6739469 urls... Those are theonly urls I wish to crawl -- hence depth 1. The configuration I have setup allows me to crawl this local server with 3 fetchers at the same timeat a rate that it doesn't overwhelm the server.

I'm using defaults for temp directories. Thus, /tmp/hadoop-fred/ is thetemp file location. The error message notes the following partial path:taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/file.out


I figure that equates to this full path:
/tmp/hadoop-fred/mapred/local/taskTracker/jobcache/job_local_0005/attempt_local_0005_m_000000_0/output/

The contents of this directory is spill[0-906].out... Nothing else. Nofile.out. There is 68G of data in this folder (ie. it looks to havedownloaded everything i need)... There is 9+ GB of free space on thefilesystem -- is it possible this is insufficient?


So, what happened? Is there a way I can recover without re-crawling?

I am running on a Fedora Core 8 virtual machine with two cores, 4 GB memory.

Let me know if any more information is needed...

Thanks,
/FjK

Nutch 1.0 Fetch failure...

Reply via email to