You will want to have one Lucene field which contains this composite key - they could be the un-tokenized concatenation of all of the subkeys, for example, and then one Term would have the full composite key, and the updateDocument technique would work fine.
-jake On Mon, Nov 16, 2009 at 11:09 AM, java8964 java8964 <java8...@hotmail.com>wrote: > > But can IndexWriter.updateDocument(Term, Document) handle the composite key > case? > > If my primary key contains field1 and field2, can I use one Term to include > both field1 and field2? > > Thanks > > > Date: Mon, 16 Nov 2009 09:44:35 -0800 > > Subject: Re: What is the best way to handle the primary key case during > lucene indexing > > From: jake.man...@gmail.com > > To: java-user@lucene.apache.org > > > > The usual way to do this is to use: > > > > IndexWriter.updateDocument(Term, Document) > > > > This method deletes all documents with the given Term in it (this would > be > > your primary key), and then adds the Document you want to add. This is > the > > traditional way to do updates, and it is fast. > > > > -jake > > > > > > > > On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <java8...@hotmail.com > >wrote: > > > > > > > > Hi, > > > > > > In our application, we will allow the user to create a primary key > defined > > > in the document. We are using lucene 2.9. > > > In this case, when we index the data coming from the client, if the > > > metadata contains the primary key defined, > > > we have to do the search/update for every row based on the primary key. > > > > > > Here is our current problems: > > > > > > 1) If the meta data coming from client defined a primary key (which can > > > contain one or multi fields), > > > then for the data supplied from the client, we have to make sure > that > > > later row will override the previous row, if they have the same primary > key > > > as the data. > > > 2) To do the above, we have to loop through the data first, to check if > any > > > later rows containing the same PK as the previous rows, so we will > build the > > > MAP in the memory to override the previous one by the latest ones. > > > This is a very expensive operation. > > > 3) Even in this case, for every row after the above filter steps, we > still > > > have to search the current index to see if any data with the same PK > exist > > > or not. So we have to do the remove before we add the new data in the > index. > > > > > > I want to know if anyone has the same requirement like this PK using > the > > > lucene? What is the best way to index data in this case? > > > > > > First, I am thinking if it is possible to remove the above step2? > > > the problem for the lucene is that when we add document in the index, > we > > > can NOT search it before commit it. > > > But we only commit once when the whole data file is finished. So we > have to > > > loop through the data once to check to see if any data sharing the same > PK > > > in the data file. > > > I am wondering if there is a way in the index writer, before it commits > > > anything, when we add the new document into it, it can do the merging > of the > > > PK data? What I mean is that if the same PK data already exist in any > > > previous added document, just remove it and let the new added data > > > containing the same PK data take the place? If we can do this, then the > > > whole pre checking data step can be removed. > > > > > > Second, for the above step 3, if the searching the existing index is > NOT > > > avoidable, what is the fast way to search by the PK? Of course we > already > > > indexed all the PK fields. When we add new data, we have to search > every row > > > of existing index by the PK fields, to see if it exist or not. If it > does, > > > remove it and add the new one. > > > We constructor the query by the PK fields at run time, then search it > row > > > by row. This is also very bad as the indexing the data for performance. > > > > > > Here is what I am thinking? > > > 1) Can I use the Indexreader.term(terms)? I heard it is much faster > than > > > the query searching? Is that right? > > > 2) Currently we are do the search row by row? Should I do it in > batching? > > > Like I will combine 100 PK search into one search, using Boolean term? > So > > > one search will give me back all the data in this 100 PK which are in > the > > > index. Then I can remove them from the index using the result set. In > this > > > case, I only need to do 1/100 search requests as before? This will much > > > faster than row by row in theory. > > > > > > > > > Please let me know any feedbacks? If you ever dealed with PK data > support, > > > please share some thougths and experience. > > > > > > Thanks for your kind help. > > > > > > _________________________________________________________________ > > > Hotmail: Free, trusted and rich email service. > > > http://clk.atdmt.com/GBL/go/171222984/direct/01/ > > > > > _________________________________________________________________ > Hotmail: Powerful Free email with security by Microsoft. > http://clk.atdmt.com/GBL/go/171222986/direct/01/ >