On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
Good to hear that I'm not insane. Unfortunately, I've run into another
odd problem.
In re-checking my configuration files and making some tweaks according
to the wiki pages, the crawl is now stopping before going anywhere.
After the injection I get:
Generator: 0 records selected for fethcing, exiting...
Stopping at depth=0 - no more URLS to fetch.
I thought maybe there was an issue with my crawl url filter, so I
commented everything out but the line to block unrecognized extensions
and the line to stop mailto:. The only other line is +.
My seed list is short (only 4 urls on a small internal network, but it
shouldn't be excluded by anything now).
Any hints on where to look and see why this Generator is dying? What
the heck am I doing so wrong? My previous crawls (with a single
machine) generally worked, but could take up to a week to complete.
The new cluster was supposed to fix that and make this easier...
It looks like your problem is related to
https://issues.apache.org/jira/browse/NUTCH-246 .
Jeff
-----Original Message-----
From: Doğacan Güney [mailto:[EMAIL PROTECTED]
Sent: Friday, May 25, 2007 10:13 AM
To: [email protected]
Subject: Re: Clustered crawl
Hi,
On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
> Is there a good explanation someone can point me to as to why when I
> setup a hadoop cluster my entire site isn't crawled? It doesn't make
> sense that I should have to tweak the number of hadoop map and reduce
> tasks in order to ensure that everything gets indexed.
And you shouldn't. Number of map and reduce tasks may affect crawling
speed but doesn't affect number of crawled urls.
>
> I followed the tutorial here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that
> only a small portion of my site was indexed. Besides explicitly
> stating every URL on the site, what should I do to ensure that my
> hadoop cluster (of only 4 machines) manages to create a full index?
Does it work on a single machine? If it does, then this is very weird.
Here are a couple of things to try:
* After injecting urls, do a readdb -stats to count the number of
injected urls.
* After generating, do a readseg -list <segment> to count the number
of generated urls.
* If the number of urls in your segment is correct, then during
fetching check out the number of successfully fetched urls in web UI.
(perhaps, cluster machines can't fetch those urls?)
>
> Thanks for the help.
>
> Jeff
>
--
Doğacan Güney
--
Doğacan Güney