On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote: > I've got the same problem. It takes ages to generate a list of url to fetch > with DB of 2M of urls. > > Do you mean that it will be faster if we configure the generation by IP ?
No, it should be faster if you partition by host :) Which job is slow, select or partition? > > I've read the nutch-default.xml. it said that we have to be careful because > it can generate a lot of DNS request to the host. > Does it mean that i have to configure a local cache DNS on my server ? > > > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > >> Hi all, > >> The gerate step of my crwal process is taking more then 2 hours....is it > >> normal? > > > > Are you partitioning urls by ip or by host? > > > >> > >> this is my stat report: > >> > >> CrawlDb statistics start: crawl/crawldb > >> Statistics for CrawlDb: crawl/crawldb > >> TOTAL urls: 586860 > >> retry 0: 578159 > >> retry 1: 1983 > >> retry 2: 2017 > >> retry 3: 4701 > >> min score: 0.0 > >> avg score: 0.0 > >> max score: 1.0 > >> status 1 (db_unfetched): 164849 > >> status 2 (db_fetched): 417306 > >> status 3 (db_gone): 4701 > >> status 5 (db_redir_perm): 4 > >> CrawlDb statistics: done > >> > >> > >> > >> > >> > >> Luca Rondanini > >> > >> > > > > > > -- > > DoÄŸacan Güney > > > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
