Hi, On 5/25/07, Bolle, Jeffrey F. <[EMAIL PROTECTED]> wrote: > Is there a good explanation someone can point me to as to why when I > setup a hadoop cluster my entire site isn't crawled? It doesn't make > sense that I should have to tweak the number of hadoop map and reduce > tasks in order to ensure that everything gets indexed.
And you shouldn't. Number of map and reduce tasks may affect crawling speed but doesn't affect number of crawled urls. > > I followed the tutorial here: > http://wiki.apache.org/nutch/NutchHadoopTutorial and have found that > only a small portion of my site was indexed. Besides explicitly > stating every URL on the site, what should I do to ensure that my > hadoop cluster (of only 4 machines) manages to create a full index? Does it work on a single machine? If it does, then this is very weird. Here are a couple of things to try: * After injecting urls, do a readdb -stats to count the number of injected urls. * After generating, do a readseg -list <segment> to count the number of generated urls. * If the number of urls in your segment is correct, then during fetching check out the number of successfully fetched urls in web UI. (perhaps, cluster machines can't fetch those urls?) > > Thanks for the help. > > Jeff > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
