Re: Unique doc ids

Terry Yang Tue, 22 Jan 2008 06:52:35 -0800

Hi,Michael,
You idea is good! But i have a question and thanks for your help!


How you plan to store a unique ID for each doc? My understanding will be
adding a field(i.e uniqueid) for each doc and the field has one identical
token value.
We can add unique ID as payload for that token before indexing. So we can
use IndexReader.termPositions() to get all the uniqueIDs and IDs.
Can u explain more about how you store a reverse UID-->ID?  How u guarantee
UID
can be mapped to the correct dynamic ID. I mean if a docid =5 and then for
some reason changed to 60, but you still stored UID-->5 in a file/memory?

On 1/22/08, Michael Busch <[EMAIL PROTECTED]> wrote:
> Hi Team,
>
> the question of how to delete with IndexWriter using doc ids is
> currently being discussed on java-user
> (http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
> thought this is a good time to mention an idea that I recently had. I'm
> planning to work on column-stored fields soon (I used to call them
> per-document payloads). Then we'll have the ability to store metadata
> for each document very efficiently in the index.
>
> This new data structure could be used to store a unique ID for each doc
> in the index. The IndexReader would then get an API that provides a
> mapping from the dynamic doc ids to the new unique ones. We would also
> have to store a reverse mapping (UID -> ID) in the index - we could use
> a VInt list + skip list for that.
>
> Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
> and provide a new API in IndexWriter "delete by UID". This would allow
> to "delete by query" as well. The disadvantage is that the index would
> become bigger, but that should still be ok: 8 bytes per doc for the
> ID->UID map (assuming we took long for the UID, which I'd suggest). The
> UID->ID map might even be a bit smaller initially (using VInts and
> VLongs), but might become bigger when the index has lot's of deleted
> docs, because then the delta encoding wouldn't be as efficient anymore
> for the UIDs.
>
> If RAM permits, the maps could also be cached in memory (optional,
> configurable). The FieldCache overhaul (LUCENE-831) with column fields
> as source can help here.
>
> After all this is implemented (column fields, UIDs, "read-only"
> IndexReaders, FieldCache overhaul) I'd like to make the column fields
> (and norms) updateable via IndexWriter.
>
> OK lot's of food for thought.
>
> -Michael
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Unique doc ids

Reply via email to