Have you tried comparing TermVectors ?
I would expect them, or an adjustment of them, to allow comparison to
focus on "important terms" (e.g. about a 100-200 terms) and then allow
a more reasonable computation.
paul
Le 12 juin 05, à 16:37, Dave Kor a écrit :
Hi,
I would like to poll the community's opinion on good strategies for
identifying
duplicate documents in a lucene index.
You see, I have an index containing roughly 25 million lucene
documents. My task
requires me to work at sentence level so each lucene document actually
contains
exactly one sentence. The issue I have right now is that sometimes,
certain
sentences are duplicated and I'ld like to be able to identify them as
a BitSet
so that I can filter away these duplicates in my search.
Obviously the brute force method of pairwise compares would take
forever. I have
tried grouping sentences using their hashCodes() and then do a
pairwise compare
between sentences that has the same hashCode, but even with a 1GB heap
I ran
out of memory after comparing 200k sentences.
Any other ideas?
Regards
Dave Kor.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]