One way to do this (depending on your system and index size) is to remove
and add every url you find.  This would ensure that every document in the
index is unique.  No need to worry about sorting and iteration and doc_ids
and the like.

It rebuilds your entire index, but if you have a duplication issue that
needs to be addressed, it's worth it.

Hope this helps.

-- j

On 1/28/06, gekkokid <[EMAIL PROTECTED]> wrote:
>
> Hi, im trying to delete duplicate documents from my index, the unique
> indentifier is the documents url (aka field "url").
>
> my initial thought of how to acomplish this is to open the index via a
> reader and sort them by the documents url and then iterate through them
> looking for a match with the current document and the previous document, if
> it matches i would delete the current document etc.
>
> what other methods that are not too taxing could i try?
>
> how could i sort the documents via url internally? what classes should i
> be looking at to do this
>
>
> Thanks,
> _gk
>

Reply via email to