Re: [Nutch-general] slow generate process

Luca Rondanini Thu, 26 Jul 2007 06:10:50 -0700

Hi Doğacan,
You can find the log at http://www.translated.net/hadoop.log.generate


It's just the output of the generate step...
...maybe you can help me to summarize the log's key-points for future 
readers!!

Thanks,
Luca

Luca Rondanini

Doğacan Güney wrote:
> On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> 
>> this is my hadoop log(just in case):
>>
>> 2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
>> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
>> /home/semantix/nutch-0.9/crawl/segments/20070725131957
>> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering: 
>> true
>> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
>> 2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
>> 'local', generating exactly one partition.
>> 2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
>> selected urls by host, for politeness.
>> 2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.
>>
> 
> Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
> don't see any obvious mistakes, but perhaps I am missing something.
> 
> Could you lower hadoop's log level to INFO (change hadoop's log level
> in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
> show how much of maps and reduces are completed so we may see what is
> taking so long. if you can attach a profiler and show us which method
> is taking all the time that would make things *a lot* more easier.
> 
> 
>>
>>
>>
>> Luca Rondanini
>>
>>
>>
>> Emmanuel wrote:
>> > The partition is slow.
>> >
>> > Any idea ?
>> >
>> >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
>> >>
>> >>> I've got the same problem. It takes ages to generate a list of url to
>> >>> fetch
>> >>> with DB of 2M of urls.
>> >>>
>> >>> Do you mean that it will be faster if we configure the generation 
>> by IP
>> >>> ?
>> >>
>> >>
>> >> No, it should be faster if you partition by host :)
>> >>
>> >> Which job is slow, select or partition?
>> >>
>> >>>
>> >>> I've read the nutch-default.xml. it said that we have to be careful
>> >>> because
>> >>> it can generate a lot of DNS request to the host.
>> >>> Does it mean that i have to configure a local cache DNS on my 
>> server ?
>> >>>
>> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>> >>> >> Hi all,
>> >>> >> The gerate step of my crwal process is taking more then 2 
>> hours....is
>> >>> it
>> >>> >> normal?
>> >>> >
>> >>> > Are you partitioning urls by ip or by host?
>> >>> >
>> >>> >>
>> >>> >> this is my stat report:
>> >>> >>
>> >>> >> CrawlDb statistics start: crawl/crawldb
>> >>> >> Statistics for CrawlDb: crawl/crawldb
>> >>> >> TOTAL urls:     586860
>> >>> >> retry 0:        578159
>> >>> >> retry 1:        1983
>> >>> >> retry 2:        2017
>> >>> >> retry 3:        4701
>> >>> >> min score:      0.0
>> >>> >> avg score:      0.0
>> >>> >> max score:      1.0
>> >>> >> status 1 (db_unfetched):        164849
>> >>> >> status 2 (db_fetched):  417306
>> >>> >> status 3 (db_gone):     4701
>> >>> >> status 5 (db_redir_perm):       4
>> >>> >> CrawlDb statistics: done
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> Luca Rondanini
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> > --
>> >>> > DoÃ„Å¸acan GÃƒÂ¼ney
>> >>> >
>> >>>
>> >>
>> >>
>> >> --
>> >> DoÄŸacan GÃ¼ney
>> >>
>> >
>>
> 
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] slow generate process

Reply via email to