Re: [Nutch-general] Clustered crawl

Doğacan Güney Sat, 26 May 2007 01:51:10 -0700

On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
> Good to hear that I'm not insane.  Unfortunately, I've run into another
> odd problem.
>
> In re-checking my configuration files and making some tweaks according
> to the wiki pages, the crawl is now stopping before going anywhere.
> After the injection I get:
> Generator: 0 records selected for fethcing, exiting...
> Stopping at depth=0 - no more URLS to fetch.
>
> I thought maybe there was an issue with my crawl url filter, so I
> commented everything out but the line to block unrecognized extensions
> and the line to stop mailto:.  The only other line is +.
>
> My seed list is short (only 4 urls on a small internal network, but it
> shouldn't be excluded by anything now).
>
> Any hints on where to look and see why this Generator is dying?  What
> the heck am I doing so wrong?  My previous crawls (with a single
> machine) generally worked, but could take up to a week to complete.
> The new cluster was supposed to fix that and make this easier...


It looks like your problem is related to
https://issues.apache.org/jira/browse/NUTCH-246 .

>
> Jeff
>
>
> -----Original Message-----
> From: Doğacan Güney [mailto:[EMAIL PROTECTED]
> Sent: Friday, May 25, 2007 10:13 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Clustered crawl
>
> Hi,
>
> On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
> > Is there a good explanation someone can point me to as to why when I
> > setup a hadoop cluster my entire site isn't crawled?  It doesn't make
> > sense that I should have to tweak the number of hadoop map and reduce
> > tasks in order to ensure that everything gets indexed.
>
> And you shouldn't. Number of map and reduce tasks may affect crawling
> speed but doesn't affect number of crawled urls.
>
> >
> > I followed the tutorial here:
> > http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that
> > only a small portion of my site was indexed.  Besides explicitly
> > stating every URL on the site, what should I do to ensure that my
> > hadoop cluster (of only 4 machines) manages to create a full index?
>
> Does it work on a single machine? If it does, then this is very weird.
>
> Here are a couple of things to try:
> * After injecting urls, do a readdb -stats to count the number of
> injected urls.
> * After generating, do a readseg -list <segment> to count the number
> of generated urls.
> * If the number of urls in your segment is correct, then during
> fetching check out the number of successfully fetched urls in web UI.
> (perhaps, cluster machines can't fetch those urls?)
>
> >
> > Thanks for the help.
> >
> > Jeff
> >
>
>
> --
> Doğacan Güney
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Clustered crawl

Reply via email to