That would be dependant on your situation and what exactly your trying to 
accomplish with Nutch.
 
In my case the goal is to produce the largest possible index size, yet still 
keep it updated. I'm basically fetching 1-2 million segments then merging them 
together, and in this case I will always require the segment data (other then 
crawl_generate, which can be safely deleted after the fetch is done).
 
If you only need, for example 3 million documents in a index and you don't 
really care what they are then you could generate a brand new 3 million URL 
segment every time, fetch it, run the database functions and index it without 
really caring about re-indexing the previous segment since you did everything 
in one operation.


----- Original Message ----
From: Ledio Ago <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, January 19, 2007 1:36:57 PM
Subject: RE: Reduce segment size


Quick question:

It wont affect re-crawling as that's dependant on the Nutch DB, but it
will prevent you from re-indexing the data that was deleted as it needs
those files.
> Why would I want to reindex entries that I've deleted?

I have never tried running Nutch "just" with the index file, it might
work or it might not but its something to test (move them out of the
directory, but don't delete them).


----- Original Message ----
From: Ledio Ago <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, January 18, 2007 8:57:15 PM
Subject: Reduce segment size


Hi there!

After a crawl/index cycle a segment directory is created which usually
contains content, index, and so on directories.
Here is what actually my current segment directory has after crawl/index
build of 2 Million URLs:

/segments/20070114151631> du -sh *
9.6G    content
212M    fetcher
5.0G    index
0       index.done
5.8G    parse_data
3.7G    parse_text

The segment directory is copied to a searcher.  As you can see the
content directory is huge.

My question is, if you just remove this directory, would that affect the
search capability, or later the recrawling and reindexing?
The content directory is so big, is there is a way not to have to copy
that directory to the searcher?

Thanks,
Ledio




Ledio Ago * Sr. Software Engineer * [EMAIL PROTECTED]

w: 415-348-7693 * f: 415-348-7032



LookSmart - Where To Look For What You Need. - Find. Save. Share.

625 Second Street, San Francisco, CA 94107
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to