Sorry, forgot to add "then re-add the documents in question". On Mon, Nov 16, 2009 at 12:45 PM, Erick Erickson <erickerick...@gmail.com>wrote:
> What is the form of the unique key? I'm a bit confused here by your > comment: > "which can contain one or multi fields". > > But it seems like IndexWriter.deleteDocuments should work here. It's easy > if your PKs are single terms, there's even a deleteDocuments(Term[]) form. > But this really *requires* that your PKs are single terms in a field. If > your PKs > are some sort of composite field, perhaps the iw.DeleteDocuments(Query[]) > would help where each query is enough to uniquely identify your document. > > Best > Erick > > > On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 > <java8...@hotmail.com>wrote: > >> >> Hi, >> >> In our application, we will allow the user to create a primary key defined >> in the document. We are using lucene 2.9. >> In this case, when we index the data coming from the client, if the >> metadata contains the primary key defined, >> we have to do the search/update for every row based on the primary key. >> >> Here is our current problems: >> >> 1) If the meta data coming from client defined a primary key (which can >> contain one or multi fields), >> then for the data supplied from the client, we have to make sure that >> later row will override the previous row, if they have the same primary key >> as the data. >> 2) To do the above, we have to loop through the data first, to check if >> any later rows containing the same PK as the previous rows, so we will build >> the MAP in the memory to override the previous one by the latest ones. >> This is a very expensive operation. >> 3) Even in this case, for every row after the above filter steps, we still >> have to search the current index to see if any data with the same PK exist >> or not. So we have to do the remove before we add the new data in the index. >> >> I want to know if anyone has the same requirement like this PK using the >> lucene? What is the best way to index data in this case? >> >> First, I am thinking if it is possible to remove the above step2? >> the problem for the lucene is that when we add document in the index, we >> can NOT search it before commit it. >> But we only commit once when the whole data file is finished. So we have >> to loop through the data once to check to see if any data sharing the same >> PK in the data file. >> I am wondering if there is a way in the index writer, before it commits >> anything, when we add the new document into it, it can do the merging of the >> PK data? What I mean is that if the same PK data already exist in any >> previous added document, just remove it and let the new added data >> containing the same PK data take the place? If we can do this, then the >> whole pre checking data step can be removed. >> >> Second, for the above step 3, if the searching the existing index is NOT >> avoidable, what is the fast way to search by the PK? Of course we already >> indexed all the PK fields. When we add new data, we have to search every row >> of existing index by the PK fields, to see if it exist or not. If it does, >> remove it and add the new one. >> We constructor the query by the PK fields at run time, then search it row >> by row. This is also very bad as the indexing the data for performance. >> >> Here is what I am thinking? >> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than >> the query searching? Is that right? >> 2) Currently we are do the search row by row? Should I do it in batching? >> Like I will combine 100 PK search into one search, using Boolean term? So >> one search will give me back all the data in this 100 PK which are in the >> index. Then I can remove them from the index using the result set. In this >> case, I only need to do 1/100 search requests as before? This will much >> faster than row by row in theory. >> >> >> Please let me know any feedbacks? If you ever dealed with PK data support, >> please share some thougths and experience. >> >> Thanks for your kind help. >> >> _________________________________________________________________ >> Hotmail: Free, trusted and rich email service. >> http://clk.atdmt.com/GBL/go/171222984/direct/01/ >> > >