Re: [Nutch-general] slow generate process

Doğacan Güney Wed, 25 Jul 2007 05:36:47 -0700

On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
> I've got the same problem. It takes ages to generate a list of url to fetch
> with DB of 2M of urls.
>
> Do you mean that it will be faster if we configure the generation by IP ?


No, it should be faster if you partition by host :)

Which job is slow, select or partition?

>
> I've read the nutch-default.xml. it said that we have to be careful because
> it can generate a lot of DNS request to the host.
> Does it mean that i have to configure a local cache DNS on my server ?
>
> > On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
> >> Hi all,
> >> The gerate step of my crwal process is taking more then 2 hours....is it
> >> normal?
> >
> > Are you partitioning urls by ip or by host?
> >
> >>
> >> this is my stat report:
> >>
> >> CrawlDb statistics start: crawl/crawldb
> >> Statistics for CrawlDb: crawl/crawldb
> >> TOTAL urls:     586860
> >> retry 0:        578159
> >> retry 1:        1983
> >> retry 2:        2017
> >> retry 3:        4701
> >> min score:      0.0
> >> avg score:      0.0
> >> max score:      1.0
> >> status 1 (db_unfetched):        164849
> >> status 2 (db_fetched):  417306
> >> status 3 (db_gone):     4701
> >> status 5 (db_redir_perm):       4
> >> CrawlDb statistics: done
> >>
> >>
> >>
> >>
> >>
> >> Luca Rondanini
> >>
> >>
> >
> >
> > --
> > DoÄŸacan GÃ¼ney
> >
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] slow generate process

Reply via email to