Re: [Nutch-general] slow generate process

Luca Rondanini Tue, 31 Jul 2007 04:08:45 -0700

Hi,

this is the crawldb stats:


Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     871597
retry 0:        816227
retry 1:        1
retry 3:        55369
min score:      0.0
avg score:      0.0
max score:      22.0
status 1 (db_unfetched):        2
status 2 (db_fetched):  816179
status 3 (db_gone):     55369
status 5 (db_redir_perm):       47
CrawlDb statistics: done


and this is the top output (while selecting):

top - 13:06:48 up 4 days, 23:42,  2 users,  load average: 1.93, 1.86, 1.78
Tasks:  72 total,   2 running,  70 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.0%us,  1.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si, 
0.0%st
Mem:    970596k total,   959372k used,    11224k free,     2592k buffers
Swap:  1951856k total,    15692k used,  1936164k free,   748784k cached



oğacan Güney wrote:
> Hi,
> 
> On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> 
>>Hi Doğacan,
>>You can find the log at http://www.translated.net/hadoop.log.generate
>>
>>It's just the output of the generate step...
>>...maybe you can help me to summarize the log's key-points for future
>>readers!!
> 
> 
> Selecting (not partitioning) that takes most of the time, which
> actually is expected since selecting has to process the entire
> crawldb. Still, more than 1 hour spent in Selector's map is weird.
> What is the size of your crawldb? Also, have you measured memory
> consumption (perhaps, selecting hits swap a lot?) ?
> 
> 
>>Thanks,
>>Luca
>>
>>Luca Rondanini
>>
>>Doğacan Güney wrote:
>>
>>>On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>>>
>>>
>>>>this is my hadoop log(just in case):
>>>>
>>>>2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
>>>>2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
>>>>/home/semantix/nutch-0.9/crawl/segments/20070725131957
>>>>2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering:
>>>>true
>>>>2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
>>>>2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
>>>>'local', generating exactly one partition.
>>>>2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
>>>>selected urls by host, for politeness.
>>>>2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.
>>>>
>>>
>>>Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
>>>don't see any obvious mistakes, but perhaps I am missing something.
>>>
>>>Could you lower hadoop's log level to INFO (change hadoop's log level
>>>in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
>>>show how much of maps and reduces are completed so we may see what is
>>>taking so long. if you can attach a profiler and show us which method
>>>is taking all the time that would make things *a lot* more easier.
>>>
>>>
>>>
>>>>
>>>>
>>>>Luca Rondanini
>>>>
>>>>
>>>>
>>>>Emmanuel wrote:
>>>>
>>>>>The partition is slow.
>>>>>
>>>>>Any idea ?
>>>>>
>>>>>
>>>>>>On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>>
>>>>>>>I've got the same problem. It takes ages to generate a list of url to
>>>>>>>fetch
>>>>>>>with DB of 2M of urls.
>>>>>>>
>>>>>>>Do you mean that it will be faster if we configure the generation
>>>>
>>>>by IP
>>>>
>>>>>>>?
>>>>>>
>>>>>>
>>>>>>No, it should be faster if you partition by host :)
>>>>>>
>>>>>>Which job is slow, select or partition?
>>>>>>
>>>>>>
>>>>>>>I've read the nutch-default.xml. it said that we have to be careful
>>>>>>>because
>>>>>>>it can generate a lot of DNS request to the host.
>>>>>>>Does it mean that i have to configure a local cache DNS on my
>>>>
>>>>server ?
>>>>
>>>>>>>>On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>>>>>>>>
>>>>>>>>>Hi all,
>>>>>>>>>The gerate step of my crwal process is taking more then 2
>>>>
>>>>hours....is
>>>>
>>>>>>>it
>>>>>>>
>>>>>>>>>normal?
>>>>>>>>
>>>>>>>>Are you partitioning urls by ip or by host?
>>>>>>>>
>>>>>>>>
>>>>>>>>>this is my stat report:
>>>>>>>>>
>>>>>>>>>CrawlDb statistics start: crawl/crawldb
>>>>>>>>>Statistics for CrawlDb: crawl/crawldb
>>>>>>>>>TOTAL urls:     586860
>>>>>>>>>retry 0:        578159
>>>>>>>>>retry 1:        1983
>>>>>>>>>retry 2:        2017
>>>>>>>>>retry 3:        4701
>>>>>>>>>min score:      0.0
>>>>>>>>>avg score:      0.0
>>>>>>>>>max score:      1.0
>>>>>>>>>status 1 (db_unfetched):        164849
>>>>>>>>>status 2 (db_fetched):  417306
>>>>>>>>>status 3 (db_gone):     4701
>>>>>>>>>status 5 (db_redir_perm):       4
>>>>>>>>>CrawlDb statistics: done
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Luca Rondanini
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>--
>>>>>>>>DoÃ„Å¸acan GÃƒÂ¼ney
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>--
>>>>>>DoÄŸacan GÃ¼ney
>>>>>>
>>>>>
>>>
> 
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] slow generate process

Reply via email to