On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote: > Good to hear that I'm not insane. Unfortunately, I've run into another > odd problem. > > In re-checking my configuration files and making some tweaks according > to the wiki pages, the crawl is now stopping before going anywhere. > After the injection I get: > Generator: 0 records selected for fethcing, exiting... > Stopping at depth=0 - no more URLS to fetch. > > I thought maybe there was an issue with my crawl url filter, so I > commented everything out but the line to block unrecognized extensions > and the line to stop mailto:. The only other line is +. > > My seed list is short (only 4 urls on a small internal network, but it > shouldn't be excluded by anything now). > > Any hints on where to look and see why this Generator is dying? What > the heck am I doing so wrong? My previous crawls (with a single > machine) generally worked, but could take up to a week to complete. > The new cluster was supposed to fix that and make this easier...
It looks like your problem is related to https://issues.apache.org/jira/browse/NUTCH-246 . > > Jeff > > > -----Original Message----- > From: Doğacan Güney [mailto:[EMAIL PROTECTED] > Sent: Friday, May 25, 2007 10:13 AM > To: [EMAIL PROTECTED] > Subject: Re: Clustered crawl > > Hi, > > On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote: > > Is there a good explanation someone can point me to as to why when I > > setup a hadoop cluster my entire site isn't crawled? It doesn't make > > sense that I should have to tweak the number of hadoop map and reduce > > tasks in order to ensure that everything gets indexed. > > And you shouldn't. Number of map and reduce tasks may affect crawling > speed but doesn't affect number of crawled urls. > > > > > I followed the tutorial here: > > http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that > > only a small portion of my site was indexed. Besides explicitly > > stating every URL on the site, what should I do to ensure that my > > hadoop cluster (of only 4 machines) manages to create a full index? > > Does it work on a single machine? If it does, then this is very weird. > > Here are a couple of things to try: > * After injecting urls, do a readdb -stats to count the number of > injected urls. > * After generating, do a readseg -list <segment> to count the number > of generated urls. > * If the number of urls in your segment is correct, then during > fetching check out the number of successfully fetched urls in web UI. > (perhaps, cluster machines can't fetch those urls?) > > > > > Thanks for the help. > > > > Jeff > > > > > -- > Doğacan Güney > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
