Re: What is the best way to handle the primary key case during lucene indexing

Jake Mannix Mon, 16 Nov 2009 12:00:58 -0800

You will want to have one Lucene field which contains this composite key -
they could
be the un-tokenized concatenation of all of the subkeys, for example, and
then one Term
would have the full composite key, and the updateDocument technique would
work fine.


  -jake

On Mon, Nov 16, 2009 at 11:09 AM, java8964 java8964 <[email protected]>wrote:

>
> But can IndexWriter.updateDocument(Term, Document) handle the composite key
> case?
>
> If my primary key contains field1 and field2, can I use one Term to include
> both field1 and field2?
>
> Thanks
>
> > Date: Mon, 16 Nov 2009 09:44:35 -0800
> > Subject: Re: What is the best way to handle the primary key case during
> lucene        indexing
> > From: [email protected]
> > To: [email protected]
> >
> > The usual way to do this is to use:
> >
> >    IndexWriter.updateDocument(Term, Document)
> >
> > This method deletes all documents with the given Term in it (this would
> be
> > your primary key), and then adds the Document you want to add.  This is
> the
> > traditional way to do updates, and it is fast.
> >
> >   -jake
> >
> >
> >
> > On Mon, Nov 16, 2009 at 9:15 AM, java8964 java8964 <[email protected]
> >wrote:
> >
> > >
> > > Hi,
> > >
> > > In our application, we will allow the user to create a primary key
> defined
> > > in the document. We are using lucene 2.9.
> > > In this case, when we index the data coming from the client, if the
> > > metadata contains the primary key defined,
> > > we have to do the search/update for every row based on the primary key.
> > >
> > > Here is our current problems:
> > >
> > > 1) If the meta data coming from client defined a primary key (which can
> > > contain one or multi fields),
> > >    then for the data supplied from the client, we have to make sure
> that
> > > later row will override the previous row, if they have the same primary
> key
> > > as the data.
> > > 2) To do the above, we have to loop through the data first, to check if
> any
> > > later rows containing the same PK as the previous rows, so we will
> build the
> > > MAP in the memory to override the previous one by the latest ones.
> > > This is a very expensive operation.
> > > 3) Even in this case, for every row after the above filter steps, we
> still
> > > have to search the current index to see if any data with the same PK
> exist
> > > or not. So we have to do the remove before we add the new data in the
> index.
> > >
> > > I want to know if anyone has the same requirement like this PK using
> the
> > > lucene? What is the best way to index data in this case?
> > >
> > > First, I am thinking if it is possible to remove the above step2?
> > > the problem for the lucene is that when we add document in the index,
> we
> > > can NOT search it before commit it.
> > > But we only commit once when the whole data file is finished. So we
> have to
> > > loop through the data once to check to see if any data sharing the same
> PK
> > > in the data file.
> > > I am wondering if there is a way in the index writer, before it commits
> > > anything, when we add the new document into it, it can do the merging
> of the
> > > PK data? What I mean is that if the same PK data already exist in any
> > > previous added document, just remove it and let the new added data
> > > containing the same PK data take the place? If we can do this, then the
> > > whole pre checking data step can be removed.
> > >
> > > Second, for the above step 3, if the searching the existing index is
> NOT
> > > avoidable, what is the fast way to search by the PK? Of course we
> already
> > > indexed all the PK fields. When we add new data, we have to search
> every row
> > > of existing index by the PK fields, to see if it exist or not. If it
> does,
> > > remove it and add the new one.
> > > We constructor the query by the PK fields at run time, then search it
> row
> > > by row. This is also very bad as the indexing the data for performance.
> > >
> > > Here is what I am thinking?
> > > 1) Can I use the Indexreader.term(terms)? I heard it is much faster
> than
> > > the query searching? Is that right?
> > > 2) Currently we are do the search row by row? Should I do it in
> batching?
> > > Like I will combine 100 PK search into one search, using Boolean term?
> So
> > > one search will give me back all the data in this 100 PK which are in
> the
> > > index. Then I can remove them from the index using the result set. In
> this
> > > case, I only need to do 1/100 search requests as before? This will much
> > > faster than row by row in theory.
> > >
> > >
> > > Please let me know any feedbacks? If you ever dealed with PK data
> support,
> > > please share some thougths and experience.
> > >
> > > Thanks for your kind help.
> > >
> > > _________________________________________________________________
> > > Hotmail: Free, trusted and rich email service.
> > > http://clk.atdmt.com/GBL/go/171222984/direct/01/
> > >
>
> _________________________________________________________________
> Hotmail: Powerful Free email with security by Microsoft.
> http://clk.atdmt.com/GBL/go/171222986/direct/01/
>

Re: What is the best way to handle the primary key case during lucene indexing

Reply via email to