Hi Doğacan, You can find the log at http://www.translated.net/hadoop.log.generate
It's just the output of the generate step... ...maybe you can help me to summarize the log's key-points for future readers!! Thanks, Luca Luca Rondanini Doğacan Güney wrote: > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > >> this is my hadoop log(just in case): >> >> 2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: >> /home/semantix/nutch-0.9/crawl/segments/20070725131957 >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: filtering: >> true >> 2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: topN: 50000 >> 2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is >> 'local', generating exactly one partition. >> 2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning >> selected urls by host, for politeness. >> 2007-07-25 14:42:13,909 INFO crawl.Generator - Generator: done. >> > > Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I > don't see any obvious mistakes, but perhaps I am missing something. > > Could you lower hadoop's log level to INFO (change hadoop's log level > in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop > show how much of maps and reduces are completed so we may see what is > taking so long. if you can attach a profiler and show us which method > is taking all the time that would make things *a lot* more easier. > > >> >> >> >> Luca Rondanini >> >> >> >> Emmanuel wrote: >> > The partition is slow. >> > >> > Any idea ? >> > >> >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote: >> >> >> >>> I've got the same problem. It takes ages to generate a list of url to >> >>> fetch >> >>> with DB of 2M of urls. >> >>> >> >>> Do you mean that it will be faster if we configure the generation >> by IP >> >>> ? >> >> >> >> >> >> No, it should be faster if you partition by host :) >> >> >> >> Which job is slow, select or partition? >> >> >> >>> >> >>> I've read the nutch-default.xml. it said that we have to be careful >> >>> because >> >>> it can generate a lot of DNS request to the host. >> >>> Does it mean that i have to configure a local cache DNS on my >> server ? >> >>> >> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: >> >>> >> Hi all, >> >>> >> The gerate step of my crwal process is taking more then 2 >> hours....is >> >>> it >> >>> >> normal? >> >>> > >> >>> > Are you partitioning urls by ip or by host? >> >>> > >> >>> >> >> >>> >> this is my stat report: >> >>> >> >> >>> >> CrawlDb statistics start: crawl/crawldb >> >>> >> Statistics for CrawlDb: crawl/crawldb >> >>> >> TOTAL urls: 586860 >> >>> >> retry 0: 578159 >> >>> >> retry 1: 1983 >> >>> >> retry 2: 2017 >> >>> >> retry 3: 4701 >> >>> >> min score: 0.0 >> >>> >> avg score: 0.0 >> >>> >> max score: 1.0 >> >>> >> status 1 (db_unfetched): 164849 >> >>> >> status 2 (db_fetched): 417306 >> >>> >> status 3 (db_gone): 4701 >> >>> >> status 5 (db_redir_perm): 4 >> >>> >> CrawlDb statistics: done >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> Luca Rondanini >> >>> >> >> >>> >> >> >>> > >> >>> > >> >>> > -- >> >>> > DoÄŸacan Güney >> >>> > >> >>> >> >> >> >> >> >> -- >> >> DoÄŸacan Güney >> >> >> > >> > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general