Hi,

On 7/26/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> Hi Doğacan,
> You can find the log at http://www.translated.net/hadoop.log.generate
>
> It's just the output of the generate step...
> ...maybe you can help me to summarize the log's key-points for future
> readers!!

Selecting (not partitioning) that takes most of the time, which
actually is expected since selecting has to process the entire
crawldb. Still, more than 1 hour spent in Selector's map is weird.
What is the size of your crawldb? Also, have you measured memory
consumption (perhaps, selecting hits swap a lot?) ?

>
> Thanks,
> Luca
>
> Luca Rondanini
>
> Doğacan Güney wrote:
> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> >
> >> this is my hadoop log(just in case):
> >>
> >> 2007-07-25 13:19:57,040 INFO  crawl.Generator - Generator: starting
> >> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: segment:
> >> /home/semantix/nutch-0.9/crawl/segments/20070725131957
> >> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: filtering:
> >> true
> >> 2007-07-25 13:19:57,041 INFO  crawl.Generator - Generator: topN: 50000
> >> 2007-07-25 13:19:57,095 INFO  crawl.Generator - Generator: jobtracker is
> >> 'local', generating exactly one partition.
> >> 2007-07-25 14:42:03,509 INFO  crawl.Generator - Generator: Partitioning
> >> selected urls by host, for politeness.
> >> 2007-07-25 14:42:13,909 INFO  crawl.Generator - Generator: done.
> >>
> >
> > Wow, partitioning 50000 URLs take 1,5 hours! Looking at the code I
> > don't see any obvious mistakes, but perhaps I am missing something.
> >
> > Could you lower hadoop's log level to INFO (change hadoop's log level
> > in log4j.properties from WARN to INFO)? IIRC, in INFO level, hadoop
> > show how much of maps and reduces are completed so we may see what is
> > taking so long. if you can attach a profiler and show us which method
> > is taking all the time that would make things *a lot* more easier.
> >
> >
> >>
> >>
> >>
> >> Luca Rondanini
> >>
> >>
> >>
> >> Emmanuel wrote:
> >> > The partition is slow.
> >> >
> >> > Any idea ?
> >> >
> >> >> On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
> >> >>
> >> >>> I've got the same problem. It takes ages to generate a list of url to
> >> >>> fetch
> >> >>> with DB of 2M of urls.
> >> >>>
> >> >>> Do you mean that it will be faster if we configure the generation
> >> by IP
> >> >>> ?
> >> >>
> >> >>
> >> >> No, it should be faster if you partition by host :)
> >> >>
> >> >> Which job is slow, select or partition?
> >> >>
> >> >>>
> >> >>> I've read the nutch-default.xml. it said that we have to be careful
> >> >>> because
> >> >>> it can generate a lot of DNS request to the host.
> >> >>> Does it mean that i have to configure a local cache DNS on my
> >> server ?
> >> >>>
> >> >>> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> >> >>> >> Hi all,
> >> >>> >> The gerate step of my crwal process is taking more then 2
> >> hours....is
> >> >>> it
> >> >>> >> normal?
> >> >>> >
> >> >>> > Are you partitioning urls by ip or by host?
> >> >>> >
> >> >>> >>
> >> >>> >> this is my stat report:
> >> >>> >>
> >> >>> >> CrawlDb statistics start: crawl/crawldb
> >> >>> >> Statistics for CrawlDb: crawl/crawldb
> >> >>> >> TOTAL urls:     586860
> >> >>> >> retry 0:        578159
> >> >>> >> retry 1:        1983
> >> >>> >> retry 2:        2017
> >> >>> >> retry 3:        4701
> >> >>> >> min score:      0.0
> >> >>> >> avg score:      0.0
> >> >>> >> max score:      1.0
> >> >>> >> status 1 (db_unfetched):        164849
> >> >>> >> status 2 (db_fetched):  417306
> >> >>> >> status 3 (db_gone):     4701
> >> >>> >> status 5 (db_redir_perm):       4
> >> >>> >> CrawlDb statistics: done
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >> Luca Rondanini
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>> >
> >> >>> > --
> >> >>> > Doğacan Güney
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >> --
> >> >> Doğacan Güney
> >> >>
> >> >
> >>
> >
> >
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to