Hi, this is the crawldb stats:
Statistics for CrawlDb: crawl/crawldb TOTAL urls: 871597 retry 0: 816227 retry 1: 1 retry 3: 55369 min score: 0.0 avg score: 0.0 max score: 22.0 status 1 (db_unfetched): 2 status 2 (db_fetched): 816179 status 3 (db_gone): 55369 status 5 (db_redir_perm): 47 CrawlDb statistics: done and this is the top output (while selecting): top - 13:06:48 up 4 days, 23:42, 2 users, load average: 1.93, 1.86, 1.78 Tasks: 72 total, 2 running, 70 sleeping, 0 stopped, 0 zombie Cpu(s): 99.0%us, 1.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 970596k total, 959372k used, 11224k free, 2592k buffers Swap: 1951856k total, 15692k used, 1936164k free, 748784k cached oğacan Güney wrote: > Hi, > > On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: > >>Hi Doğacan, >>You can find the log at http://www.translated.net/hadoop.log.generate >> >>It's just the output of the generate step... >>...maybe you can help me to summarize the log's key-points for future >>readers!! > > > Selecting (not partitioning) that takes most of the time, which > actually is expected since selecting has to process the entire > crawldb. Still, more than 1 hour spent in Selector's map is weird. > What is the size of your crawldb? Also, have you measured memory > consumption (perhaps, selecting hits swap a lot?) ? > > >>Thanks, >>Luca >> >>Luca Rondanini >> >>Doğacan Güney wrote: >> >>>On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: >>> >>> >>>>this is my hadoop log(just in case): >>>> >>>>2007-07-25 13:19:57,040 INFO crawl.Generator - Generator: starting >>>>2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: segment: >>>>/home/semantix/nutch-0.9/crawl/segments/20070725131957 >>>>2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: filtering: >>>>true >>>>2007-07-25 13:19:57,041 INFO crawl.Generator - Generator: topN: 50000 >>>>2007-07-25 13:19:57,095 INFO crawl.Generator - Generator: jobtracker is >>>>'local', generating exactly one partition. >>>>2007-07-25 14:42:03,509 INFO crawl.Generator - Generator: Partitioning >>>>selected urls by host, for politeness. >>>>2007-07-25 14:42:13,909 INFO crawl.Generator - Generator: done. >>>> >>> >>>Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I >>>don't see any obvious mistakes, but perhaps I am missing something. >>> >>>Could you lower hadoop's log level to INFO (change hadoop's log level >>>in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop >>>show how much of maps and reduces are completed so we may see what is >>>taking so long. if you can attach a profiler and show us which method >>>is taking all the time that would make things *a lot* more easier. >>> >>> >>> >>>> >>>> >>>>Luca Rondanini >>>> >>>> >>>> >>>>Emmanuel wrote: >>>> >>>>>The partition is slow. >>>>> >>>>>Any idea ? >>>>> >>>>> >>>>>>On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote: >>>>>> >>>>>> >>>>>>>I've got the same problem. It takes ages to generate a list of url to >>>>>>>fetch >>>>>>>with DB of 2M of urls. >>>>>>> >>>>>>>Do you mean that it will be faster if we configure the generation >>>> >>>>by IP >>>> >>>>>>>? >>>>>> >>>>>> >>>>>>No, it should be faster if you partition by host :) >>>>>> >>>>>>Which job is slow, select or partition? >>>>>> >>>>>> >>>>>>>I've read the nutch-default.xml. it said that we have to be careful >>>>>>>because >>>>>>>it can generate a lot of DNS request to the host. >>>>>>>Does it mean that i have to configure a local cache DNS on my >>>> >>>>server ? >>>> >>>>>>>>On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote: >>>>>>>> >>>>>>>>>Hi all, >>>>>>>>>The gerate step of my crwal process is taking more then 2 >>>> >>>>hours....is >>>> >>>>>>>it >>>>>>> >>>>>>>>>normal? >>>>>>>> >>>>>>>>Are you partitioning urls by ip or by host? >>>>>>>> >>>>>>>> >>>>>>>>>this is my stat report: >>>>>>>>> >>>>>>>>>CrawlDb statistics start: crawl/crawldb >>>>>>>>>Statistics for CrawlDb: crawl/crawldb >>>>>>>>>TOTAL urls: 586860 >>>>>>>>>retry 0: 578159 >>>>>>>>>retry 1: 1983 >>>>>>>>>retry 2: 2017 >>>>>>>>>retry 3: 4701 >>>>>>>>>min score: 0.0 >>>>>>>>>avg score: 0.0 >>>>>>>>>max score: 1.0 >>>>>>>>>status 1 (db_unfetched): 164849 >>>>>>>>>status 2 (db_fetched): 417306 >>>>>>>>>status 3 (db_gone): 4701 >>>>>>>>>status 5 (db_redir_perm): 4 >>>>>>>>>CrawlDb statistics: done >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>Luca Rondanini >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>-- >>>>>>>>DoÄŸacan Güney >>>>>>>> >>>>>>> >>>>>> >>>>>>-- >>>>>>DoÄŸacan Güney >>>>>> >>>>> >>> > > ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
