Re: [Nutch-general] fetcher : some doubts

Sean Dean Tue, 02 Jan 2007 01:48:29 -0800

I'm glad you got the slowness issue straightened out.
 
When you import the dmoz urls into your Nutch DB, the "-subset" command isn't 
really meant to limit the size of your fetch lists. This becomes even more true 
when you start re-fetching. You can actually skip the subset command and allow 
all of them to go in, unless you have your own custom filtering 
method/requirement.
 
You should use the "-topN" command instead when you generate your segment file. 
This will create a segment with an exact number of urls. Below are examples of 
creating a segment with 1 million urls to fetch for each Nutch architecture;
 
(Nutch 0.7) bin/nutch generate db segments -topN 1000000

(Nutch 0.8+) bin/nutch generate crawl/crawldb crawl/segments -topN 1000000

----- Original Message ----
From: shrinivas patwardhan <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, January 2, 2007 4:25:13 AM
Subject: Re: fetcher : some doubts

thank you Sean  Dean
that sounds good .. i will try it out .
tell me if i am  rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by  using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right way
and for the previous problem of the searching being slow .. it wasnt my
hardware but my segments were corrupt i fixed them and the search runs fine
now

Thanks & Regards
Shrinivas Patwardhan

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] fetcher : some doubts

Reply via email to