Andrzej Bialecki wrote:
Then, I was wondering what the "dedup" operation really does. From reading the code in DeleteDuplicates.java (as far as I understand it ;-), it seems to me that only the Lucene index documents are being deleted, and not the source document data contained in fetcher*/data files in segments. If that's the case, then what is happening is that the duplicates are still there, they are just removed from the searchable Lucene index. Is this correct?

Yes, that's correct. Deduplication just marks things deleted in the Lucene indexes.


If so, would it make sense to remove also the page data referred to by the duplicate entries (saves space)?

It might save some disk space to do that, but usually not a huge amount, perhaps 20%, and it wouldn't make anything faster. This data is always assumed to be too big to fit in memory. Performance should be good even if access to it always requires a disk seek. Disk space is very cheap, so there's little incentive to highly optimize this. Also, creating a de-duplicated copy of all of the segment data could temporarily require nearly twice the disk space, which might be a problem.


For best performance, index data must fit in memory. So there is incentive to make that as small as possible. Performing duplicate elimination on segment indexes, then merging these into larger indexes achieves this.

Another question, about merge: what is the purpose of this operation? I mean, I understand that you may want to merge several Lucene indexes from different segments into one huge index; but won't they still refer to the non-merged segment data? This data is needed to generate snippets and to present a cached view, so I need to keep it around anyway; but wouldn't it make more sense to merge this data too, if I decided to merge the Lucene indices?

Even when indexes are merged, the search code still reads other data from the non-merged segment directories. As mentioned above, there's not a lot of benefit in merging these.


Doug

Doug



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to