I believe I got to the bottom of this one.
I think it was a simple disk space issue in /tmp where hadoop writes its data
by default. It's a little hard to catch because once bin/crawl exits, hadoop
cleans up it's data, so when you look at disk usage, /tmp/ looks like it has
tons.
The giveaway is in logs/hadoop.log and the error there is:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid
local directory for output/file.out
Afterwards, the segment directory has a crawl_generate directory, but no others.
To fix, increase the /tmp partition, or, (repeating info from a thread I saw
from earlier this year) configure hadoop to write to another directory with:
Increase /tmp partition, or configure nutch to write the hadoop tmp elsewhere:
<property>
<name>hadoop.tmp.dir</name>
<value>${path/to/hadoop/temp}</value>
</property>
________________________________________
From: Os Tyler
Sent: Tuesday, August 06, 2013 10:30 AM
To: [email protected]
Subject: RE: Fetch "Read time out" and crawl_parse "Input path does not exist"
Thank you, Sebastian.
If a segment is incomplete due to filling up the hard drive, should you delete
that segment?
The other segments I deleted had already been indexed. I have read in a few
threads that once a segment is indexed it can be deleted, is that correct?
Exact error message for the "Read time out":
fetch of http://redacted.com/Talk:JRM_Equipment failed with:
java.net.SocketTimeoutException: Read timed out
-finishing thread FetcherThread, activeThreads=8
And ... there's no 'content' directory in the segment directory after bin/crawl
exits with error.
________________________________________
From: Sebastian Nagel [[email protected]]
Sent: Tuesday, August 06, 2013 10:00 AM
To: [email protected]
Subject: Re: Fetch "Read time out" and crawl_parse "Input path does not exist"
Hi,
> - To clear disk space I removed all segments
And the content is already indexed by Solr?
If not: Why you didn't also remove crawl db and link db.
If segments are removed you have to fetch all pages again,
no matter whether to start from the seeds or re-fetch URLs
from existing crawl db.
> - Ever since, re-running bin/crawl fails at the fetch point with multiple
> "Read time out" errors
Can you send concrete example (exact message)?
> (The only directory in segments/xxxxxx is crawl_generate)
There should be also directory content/ which holds the raw page content.
Sebastian
On 08/06/2013 03:21 PM, Os Tyler wrote:
> Thanks in advance for any help you can provide.
>
> Not sure exactly what's relevant here, but I have not been able to complete a
> full bin/crawl since I had a "No space left on device" error.
>
> Using nutch-1.6.
> - bin/crawl had been running as expected for 20+ iterations
> - On one run, the disk ran out of space and threw the "No space left on
> device" error.
> - The db.fetch.interval.default is set at 80,000 (less than 24 hours)
> - To clear disk space I removed all segments
> - Ever since, re-running bin/crawl fails at the fetch point with multiple
> "Read time out" errors
> bin/crawl exits when it attempts 'crawl parse' because the crawl_parse,
> crawl_data, etc. directories do not exist. (The only directory in
> segments/xxxxxx is crawl_generate)
>
> What might be the solution?
>