Hi there,

After running a couple of test cycles through injecting, fetching and indexing, a couple of questions came to me which I can't easily answer... I'd appreciate some help.

The segments/xxxx/fetcher* directories contain the data from the fetched pages. However, after the fetching and indexing is done, do I need to keep the fetchlist anymore? My guess is "no"...

Then, I was wondering what the "dedup" operation really does. From reading the code in DeleteDuplicates.java (as far as I understand it ;-), it seems to me that only the Lucene index documents are being deleted, and not the source document data contained in fetcher*/data files in segments. If that's the case, then what is happening is that the duplicates are still there, they are just removed from the searchable Lucene index. Is this correct?

If so, would it make sense to remove also the page data referred to by the duplicate entries (saves space)?

Another question, about merge: what is the purpose of this operation? I mean, I understand that you may want to merge several Lucene indexes from different segments into one huge index; but won't they still refer to the non-merged segment data? This data is needed to generate snippets and to present a cached view, so I need to keep it around anyway; but wouldn't it make more sense to merge this data too, if I decided to merge the Lucene indices?

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




------------------------------------------------------- The SF.Net email is sponsored by EclipseCon 2004 Premiere Conference on Open Tools Development and Integration See the breadth of Eclipse activity. February 3-5 in Anaheim, CA. http://www.eclipsecon.org/osdn _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to