Re: What is the best way to handle the primary key case during lucene indexing

Erick Erickson Mon, 16 Nov 2009 09:49:09 -0800

Sorry, forgot to add "then re-add the documents in question".

On Mon, Nov 16, 2009 at 12:45 PM, Erick Erickson <erickerick...@gmail.com>wrote:


> What is the form of the unique key? I'm a bit confused here by your
> comment:
> "which can contain one or multi fields".
>
> But it seems like IndexWriter.deleteDocuments should work here. It's easy
> if your PKs are single terms, there's even a deleteDocuments(Term[]) form.
> But this really *requires* that your PKs are single terms in a field. If
> your PKs
> are some sort of composite field, perhaps the iw.DeleteDocuments(Query[])
> would help where each query is enough to uniquely identify your document.
>
> Best
> Erick
>
>
> On Mon, Nov 16, 2009 at 12:15 PM, java8964 java8964 
> <java8...@hotmail.com>wrote:
>
>>
>> Hi,
>>
>> In our application, we will allow the user to create a primary key defined
>> in the document. We are using lucene 2.9.
>> In this case, when we index the data coming from the client, if the
>> metadata contains the primary key defined,
>> we have to do the search/update for every row based on the primary key.
>>
>> Here is our current problems:
>>
>> 1) If the meta data coming from client defined a primary key (which can
>> contain one or multi fields),
>>    then for the data supplied from the client, we have to make sure that
>> later row will override the previous row, if they have the same primary key
>> as the data.
>> 2) To do the above, we have to loop through the data first, to check if
>> any later rows containing the same PK as the previous rows, so we will build
>> the MAP in the memory to override the previous one by the latest ones.
>> This is a very expensive operation.
>> 3) Even in this case, for every row after the above filter steps, we still
>> have to search the current index to see if any data with the same PK exist
>> or not. So we have to do the remove before we add the new data in the index.
>>
>> I want to know if anyone has the same requirement like this PK using the
>> lucene? What is the best way to index data in this case?
>>
>> First, I am thinking if it is possible to remove the above step2?
>> the problem for the lucene is that when we add document in the index, we
>> can NOT search it before commit it.
>> But we only commit once when the whole data file is finished. So we have
>> to loop through the data once to check to see if any data sharing the same
>> PK in the data file.
>> I am wondering if there is a way in the index writer, before it commits
>> anything, when we add the new document into it, it can do the merging of the
>> PK data? What I mean is that if the same PK data already exist in any
>> previous added document, just remove it and let the new added data
>> containing the same PK data take the place? If we can do this, then the
>> whole pre checking data step can be removed.
>>
>> Second, for the above step 3, if the searching the existing index is NOT
>> avoidable, what is the fast way to search by the PK? Of course we already
>> indexed all the PK fields. When we add new data, we have to search every row
>> of existing index by the PK fields, to see if it exist or not. If it does,
>> remove it and add the new one.
>> We constructor the query by the PK fields at run time, then search it row
>> by row. This is also very bad as the indexing the data for performance.
>>
>> Here is what I am thinking?
>> 1) Can I use the Indexreader.term(terms)? I heard it is much faster than
>> the query searching? Is that right?
>> 2) Currently we are do the search row by row? Should I do it in batching?
>> Like I will combine 100 PK search into one search, using Boolean term? So
>> one search will give me back all the data in this 100 PK which are in the
>> index. Then I can remove them from the index using the result set. In this
>> case, I only need to do 1/100 search requests as before? This will much
>> faster than row by row in theory.
>>
>>
>> Please let me know any feedbacks? If you ever dealed with PK data support,
>> please share some thougths and experience.
>>
>> Thanks for your kind help.
>>
>> _________________________________________________________________
>> Hotmail: Free, trusted and rich email service.
>> http://clk.atdmt.com/GBL/go/171222984/direct/01/
>>
>
>

Re: What is the best way to handle the primary key case during lucene indexing

Reply via email to