Unique doc ids

Michael Busch Tue, 22 Jan 2008 03:09:40 -0800

Hi Team,

the question of how to delete with IndexWriter using doc ids is
currently being discussed on java-user
(http://www.gossamer-threads.com/lists/lucene/java-user/57228), so I
thought this is a good time to mention an idea that I recently had. I'm
planning to work on column-stored fields soon (I used to call them
per-document payloads). Then we'll have the ability to store metadata
for each document very efficiently in the index.


This new data structure could be used to store a unique ID for each doc
in the index. The IndexReader would then get an API that provides a
mapping from the dynamic doc ids to the new unique ones. We would also
have to store a reverse mapping (UID -> ID) in the index - we could use
a VInt list + skip list for that.

Then we should be able to make IndexReaders "read-only" (LUCENE-1030)
and provide a new API in IndexWriter "delete by UID". This would allow
to "delete by query" as well. The disadvantage is that the index would
become bigger, but that should still be ok: 8 bytes per doc for the
ID->UID map (assuming we took long for the UID, which I'd suggest). The
UID->ID map might even be a bit smaller initially (using VInts and
VLongs), but might become bigger when the index has lot's of deleted
docs, because then the delta encoding wouldn't be as efficient anymore
for the UIDs.

If RAM permits, the maps could also be cached in memory (optional,
configurable). The FieldCache overhaul (LUCENE-831) with column fields
as source can help here.

After all this is implemented (column fields, UIDs, "read-only"
IndexReaders, FieldCache overhaul) I'd like to make the column fields
(and norms) updateable via IndexWriter.

OK lot's of food for thought.

-Michael

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unique doc ids

Reply via email to