You need to delete the old index before you re-index when working within the
same directory structure.
This is the procedure I follow, which is pretty much what your doing. This
assumes you already have at least one active segment and index. Edit as needed.
bin/nutch generate crawl/crawldb crawl/segments -topN 1000000
bin/nutch fetch $$
bin/nutch updatedb crawl/crawldb $$
bin/nutch invertlinks crawl/linkdb $$
bin/nutch mergesegs crawl/segments/merged -dir crawl/segments
rm -fdr crawl/indexes/
rm -fdr crawl/segments/2*
mv crawl/segments/merged/2* crawl/segments/
rm -fdr crawl/segments/merged/
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb $$
bin/nutch dedup crawl/indexes
$$ = your current segment, note that after the merge takes place it will be a
newly created directory.
----- Original Message ----
From: Justin Hartman <[EMAIL PROTECTED]>
To: [email protected]
Sent: Tuesday, January 2, 2007 4:18:36 AM
Subject: Re: fetcher : some doubts
On 1/2/07, Sean Dean <[EMAIL PROTECTED]> wrote:
> There actually isn't much of a reason to generate "huge" multi-million page
> fetch lists when you can create lots of smaller ones and merge them together.
> This allows for more of a ladder-style approach, and in some cases reduces
> the risk of errors in terms of Hadoop versions (0.8+) with large
> unrecoverable fetches or failed parse-reduce stag
The problem I am faced with is I'm not sure how to merge my indexes
together. For example I run a fetch of about 200,000 pages in about 3
or 4 different fetches. Once done I run the index command and all goes
very well and my index is built.
That said if I try and run a new fetch and then try and index the new
fetch I get an error saying "crawl/indexes" already exists.
How does one actually merge different fetches to the same index
without having to recreate the index each time?
Thanks!
Justin
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general