The partition is slow.
Any idea ?
On 7/25/07, Emmanuel <[EMAIL PROTECTED]> wrote:
I've got the same problem. It takes ages to generate a list of url to
fetch
with DB of 2M of urls.
Do you mean that it will be faster if we configure the generation by IP
?
No, it should be faster if you partition by host :)
Which job is slow, select or partition?
I've read the nutch-default.xml. it said that we have to be careful
because
it can generate a lot of DNS request to the host.
Does it mean that i have to configure a local cache DNS on my server ?
> On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
>> Hi all,
>> The gerate step of my crwal process is taking more then 2 hours....is
it
>> normal?
>
> Are you partitioning urls by ip or by host?
>
>>
>> this is my stat report:
>>
>> CrawlDb statistics start: crawl/crawldb
>> Statistics for CrawlDb: crawl/crawldb
>> TOTAL urls: 586860
>> retry 0: 578159
>> retry 1: 1983
>> retry 2: 2017
>> retry 3: 4701
>> min score: 0.0
>> avg score: 0.0
>> max score: 1.0
>> status 1 (db_unfetched): 164849
>> status 2 (db_fetched): 417306
>> status 3 (db_gone): 4701
>> status 5 (db_redir_perm): 4
>> CrawlDb statistics: done
>>
>>
>>
>>
>>
>> Luca Rondanini
>>
>>
>
>
> --
> Doğacan Güney
>
--
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general