Yes, lucene depends on consecutive docids. For the query side, the following thjings come to mind. - for sorting, the FieldCache allocates arrays up to maxDoc() - for deleted documents, it's a BitVector up to maxDoc() - Some queries like MatchAllDocumentsQuery do a linear scan through deleted documents
Just add a field to every document that will act as the id. If you need more performance you could cache the mapping from external_id -> internal_id. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/11/05, Shane O'Sullivan <[EMAIL PROTECTED]> wrote: > > Hi all, > > As far as I understand today, Lucene assigns docIDs to documents according > to the order in which the documents are added to the index. Hence, docIDs > are assigned by the engine in a sequential manner, without gaps. This > order > of document identifiers then determines the order of the postings in the > postings lists, i.e. all postings lists are sorted by docID. It also means > that the same document appearing in two different indices would probably > not > have the same docID (unless some extreme care was taken to insert > documents > in the same order). > > There are situations where the application wants to determine the docID > for > the index, i.e. to control the ordering of occurrences in the postings > lists. This is useful to ensure, for example, that a document has a stable > and consistent document identifier regardless of insertion order to an > index. > > In either case, the application would want to pass into the index the > numeric identifier of the document. However, such identifiers may not be > sequential, i.e. it's possible that there would be a document with docID M > without there being any document whose docID is M-1. > > Q1. How difficult would it be to change Lucene to accept the docIDs from > the > application, and not care about any possible gaps those ids may have? > One possible problem is that since the Doc Ids could become very large, > and > are non-sequential, creating a single array for them all would not be > feasible. > > Q2. Does Lucene's search code depend on the fact that document IDs are > sequential? > > Thanks > > Shane > >
