I'm glad you got the slowness issue straightened out.
When you import the dmoz urls into your Nutch DB, the "-subset" command isn't
really meant to limit the size of your fetch lists. This becomes even more true
when you start re-fetching. You can actually skip the subset command and allow
all of them to go in, unless you have your own custom filtering
method/requirement.
You should use the "-topN" command instead when you generate your segment file.
This will create a segment with an exact number of urls. Below are examples of
creating a segment with 1 million urls to fetch for each Nutch architecture;
(Nutch 0.7) bin/nutch generate db segments -topN 1000000
(Nutch 0.8+) bin/nutch generate crawl/crawldb crawl/segments -topN 1000000
----- Original Message ----
From: shrinivas patwardhan <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, January 2, 2007 4:25:13 AM
Subject: Re: fetcher : some doubts
thank you Sean Dean
that sounds good .. i will try it out .
tell me if i am rite :
i case of a dmoz index file is injected in the db .. then i generate only
few segments by using -subset and then fetch them ..
and then go on and generate the next set of segments i hope i am heading the
right way
and for the previous problem of the searching being slow .. it wasnt my
hardware but my segments were corrupt i fixed them and the search runs fine
now
Thanks & Regards
Shrinivas Patwardhan
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general