Re: Updating tag-indexes

Ivan Vasilev Tue, 19 Aug 2008 08:34:09 -0700

Thanks Michael and Erick,

Yes we have our own unique IDs for the docs - we use them for otherpurposes, but here we need to deal with Lucene IDs as it is importantfor ParalelReader.Ok then I will implement the algorithm for updating the small tag-indexas I intended and as it is not guarantied from the API I will take careto check if docIDs remain to be assigned sequentially in each new Lucenerelease.I will also take care about deletions - first before starting updateprocess for the small index I will keep info about deleted docs, thenwill undelete them. After that will start update algorithm. When itfinishes if there are deletions this will mean that some exceptionoccurred. The same will mean if the number of the docs is less thaninitial (then some collapse of deletions occurred) - will have torestart the process. If everything is ok then will delete again the docsabout I have kept info in the begining.


Thanks Once Again,
Ivan



Michael McCandless wrote:

Yes, docIDs are currently sequentially assigned, starting with 0.
BUT: on hitting an exception (say in your analyzer) it will usuallyuse up a docID (and then immediately mark it as deleted).
Also, this behavior isn't "promised" in the API, ie it could in theory(though I think it unlikely) change in a future release of Lucene.
And remember when a merge completes (or, optimize), any deleted docswill "collapse down" all docIDs after them.
Mike

Ivan Vasilev wrote:
Hi Lucene Guys,
I have a question that is simple but is important for me. I did notfound the answer in the javadoc so I am asking here.When adding Document-s by the method IndexWriter.addDocument(doc)does the documents obtain Lucene IDs in the order that they are addedto the IndexWriter? I mean will first added doc be with Lucene ID 0,second added with Lucene ID 1, etc?
Bellow I describe why I am asking this.
We plan to split our index to two separate indexes that will be readby ParallelReader class. This is so because the one of them willcontain field(s) that will be indexed and stored and it will befrequently changed. So to have always correct data returned from theParallelReader when changing documents in the small index the LuceneIDs of these docs have to remain the same.To do this Karl Wettin suggests a solution described in *LUCENE-879<https://issues.apache.org/jira/browse/LUCENE-879>*. I do not likethis solution because it is connected to changing Lucene source code,and after each refactoring potentially I will have problems. Thesolution is related to optimizing index so it will not be reasonablyfaster than the one that I prefer. And it is:1. Read the whole index and reconstruct the documents including indexdata by using TermDocs and TermEnum classes;
2. Change the needed documents;
3. Index documents in new index that will replace the initial one.
I can even simplify this algorithm (and the speed) if all the fieldswill be always stored - I can read just the stored data and based onthis to reconstruct the content of the docs and re index them in new.
But anyway everything in the my approaches will depend on this - areLuceneIDs in the index ordered in the same way as docs are added tothe IndexWriter.
Thanks in Advance,
Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


__________ NOD32 3366 (20080819) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Updating tag-indexes

Reply via email to