Re: [Nutch-general] Clustered crawl

Doğacan Güney Fri, 25 May 2007 07:14:09 -0700

Hi,

On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote:
> Is there a good explanation someone can point me to as to why when I
> setup a hadoop cluster my entire site isn't crawled?  It doesn't make
> sense that I should have to tweak the number of hadoop map and reduce
> tasks in order to ensure that everything gets indexed.


And you shouldn't. Number of map and reduce tasks may affect crawling
speed but doesn't affect number of crawled urls.

>
> I followed the tutorial here:
> http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that
> only a small portion of my site was indexed.  Besides explicitly
> stating every URL on the site, what should I do to ensure that my
> hadoop cluster (of only 4 machines) manages to create a full index?

Does it work on a single machine? If it does, then this is very weird.

Here are a couple of things to try:
* After injecting urls, do a readdb -stats to count the number of injected urls.
* After generating, do a readseg -list <segment> to count the number
of generated urls.
* If the number of urls in your segment is correct, then during
fetching check out the number of successfully fetched urls in web UI.
(perhaps, cluster machines can't fetch those urls?)

>
> Thanks for the help.
>
> Jeff
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Clustered crawl

Reply via email to