Hi, On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > Hi Doğacan, > You can find the log at http://www.translated.net/hadoop.log.generate > > It's just the output of the generate step... > ...maybe you can help me to summarize the log's key-points for future > readers!!
Selecting (not partitioning) that takes most of the time, which actually is expected since selecting has to process the entire crawldb. Still, more than 1 hour spent in Selector's map is weird. What is the size of your crawldb? Also, have you measured memory consumption (perhaps, selecting hits swap a lot?) ? > > Thanks, > Luca > > Luca Rondanini > > Doğacan Güney wrote: > > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > > > >> this is my hadoop log(just in case): > >> > >> 2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting > >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: > >> /home/semantix/nutch-0.9/crawl/segments/20070725131957 > >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: filtering: > >> true > >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: topN: 50000 > >> 2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is > >> 'local', generating exactly one partition. > >> 2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning > >> selected urls by host, for politeness. > >> 2007-07-25 14:42:13,909 INFO crawl.Generator - Generator: done. > >> > > > > Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I > > don't see any obvious mistakes, but perhaps I am missing something. > > > > Could you lower hadoop's log level to INFO (change hadoop's log level > > in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop > > show how much of maps and reduces are completed so we may see what is > > taking so long. if you can attach a profiler and show us which method > > is taking all the time that would make things *a lot* more easier. > > > > > >> > >> > >> > >> Luca Rondanini > >> > >> > >> > >> Emmanuel wrote: > >> > The partition is slow. > >> > > >> > Any idea ? > >> > > >> >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote: > >> >> > >> >>> I've got the same problem. It takes ages to generate a list of url to > >> >>> fetch > >> >>> with DB of 2M of urls. > >> >>> > >> >>> Do you mean that it will be faster if we configure the generation > >> by IP > >> >>> ? > >> >> > >> >> > >> >> No, it should be faster if you partition by host :) > >> >> > >> >> Which job is slow, select or partition? > >> >> > >> >>> > >> >>> I've read the nutch-default.xml. it said that we have to be careful > >> >>> because > >> >>> it can generate a lot of DNS request to the host. > >> >>> Does it mean that i have to configure a local cache DNS on my > >> server ? > >> >>> > >> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > >> >>> >> Hi all, > >> >>> >> The gerate step of my crwal process is taking more then 2 > >> hours....is > >> >>> it > >> >>> >> normal? > >> >>> > > >> >>> > Are you partitioning urls by ip or by host? > >> >>> > > >> >>> >> > >> >>> >> this is my stat report: > >> >>> >> > >> >>> >> CrawlDb statistics start: crawl/crawldb > >> >>> >> Statistics for CrawlDb: crawl/crawldb > >> >>> >> TOTAL urls: 586860 > >> >>> >> retry 0: 578159 > >> >>> >> retry 1: 1983 > >> >>> >> retry 2: 2017 > >> >>> >> retry 3: 4701 > >> >>> >> min score: 0.0 > >> >>> >> avg score: 0.0 > >> >>> >> max score: 1.0 > >> >>> >> status 1 (db_unfetched): 164849 > >> >>> >> status 2 (db_fetched): 417306 > >> >>> >> status 3 (db_gone): 4701 > >> >>> >> status 5 (db_redir_perm): 4 > >> >>> >> CrawlDb statistics: done > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> Luca Rondanini > >> >>> >> > >> >>> >> > >> >>> > > >> >>> > > >> >>> > -- > >> >>> > DoÄŸacan Güney > >> >>> > > >> >>> > >> >> > >> >> > >> >> -- > >> >> DoÄŸacan Güney > >> >> > >> > > >> > > > > > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
