Re: updating existing data in index vs inserting new data in index

Erick Erickson Thu, 07 Jul 2011 07:52:35 -0700

Let me re-state a few things to see if I've got it right:

> your schema.xml file has an entry like <uniqueKey>order_id</uniqueKey>, right?


> given this definition, any document added with an order_id that already 
> exists in the
   Solr index will be replaced. i.e. you should have one and only one
document with a
   given order_id.

> case matters. Check via the admin page ("schema browser") to see if you have
   two fields, order_id an ORDER_ID.

> How are you checking that your docs are duplicates? If you do a search on
   order_id, you should get back one and only one document (assuming the
   definition above). A document that's deleted will just be marked as deleted,
   the data won't be purged from the index. It won't show in search results, but
   it will show if you use lower-level ways to access the data.

> Whenever you change your schema, it's best to clean the index, restart the 
> server and
    re-index from scratch. Solr won't retroactively remove duplicate
<uniqueKey> entries.

> On the stats admin/stats page you should see maxDocs and numDocs. The 
> difference
   between these should be the number of deleted documents.

> Solr doesn't "manage" unique keys. All that happens is Solr will replace any
   pre-existing documents where *you've* defined the <uniqueKey> when a
   new doc is added...

Hope this helps
Erick

On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec <mark.juszc...@gmail.com> wrote:
> Bob
>
> No, I don't.  Let me look into that and post my results.
>
> Mark
>
>
> On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford <bob.sandif...@sirsidynix.com
>> wrote:
>
>> Hi, Mark.
>>
>> I haven't used DIH myself - so I'll need to leave comments on your set up
>> to others who have done so.
>>
>> Another question - after your initial index create (and after each delta),
>> do you run a 'commit'?  Do you run an 'optimize'?  (Without the optimize,
>> 'deleted' records still show up in query results...)
>>
>> Bob Sandiford | Lead Software Engineer | SirsiDynix
>> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
>> www.sirsidynix.com
>>
>>
>> > -----Original Message-----
>> > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
>> > Sent: Thursday, July 07, 2011 10:04 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: updating existing data in index vs inserting new data in
>> > index
>> >
>> > Bob
>> >
>> > Thanks very much for the reply!
>> >
>> > I am using a unique integer called order_id as the Solr index key.
>> >
>> > My query, deltaQuery and deltaImportQuery are below:
>> >
>> > <entity name="item1"
>> >   pk="ORDER_ID"
>> >   query="select 1 as TABLE_ID , orders.order_id,
>> > orders.order_booked_ind,
>> > orders.order_dt, orders.cancel_dt,     orders.account_manager_id,
>> > orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
>> > orders.approved_discount_pct, orders.campaign_nm,
>> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
>> > orders"
>> >
>> >   deltaImportQuery="select 1 as TABLE_ID, orders.order_id,
>> > orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
>> > orders.account_manager_id, orders.of_header_id,
>> > orders.order_status_lov_id,
>> > orders.order_type_id, orders.approved_discount_pct, orders.campaign_nm,
>> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from orders
>> > where orders.order_id = '${dataimporter.delta.ORDER_ID}'"
>> >
>> >   deltaQuery="select orders.order_id from orders where orders.change_dt
>> > >
>> > to_date('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')" >
>> >         </entity>
>> >
>> > The test I am running is two part:
>> >
>> > 1.  After I do a full import of the index, I insert a brand new record
>> > (with
>> > a never existed before order_id) in the database.  The delta import
>> > picks
>> > this up just fine.
>> >
>> > 2.  After the full import, I modify a record with an order_id that
>> > already
>> > shows up in the index.  I have verified there is only one record with
>> > this
>> > order_id in both the index and the db before I do the delta update.
>> >
>> > I guess the question is, am I screwing myself up by defining my own Solr
>> > index key?  I want to, ultimately, be able to search on ORDER_ID in the
>> > Solr
>> > index.  However, the docs say (I think) a field does not have to be the
>> > Solr
>> > primary key in order to be searchable.  Would I be better off letting
>> > Solr
>> > manage the keys?
>> >
>> > Mark
>> >
>> > On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
>> > <bob.sandif...@sirsidynix.com>wrote:
>> >
>> > > What are you using as the unique id in your Solr index?  It sounds
>> > like you
>> > > may have one value as your Solr index unique id, which bears no
>> > resemblance
>> > > to a unique[1] id derived from your data...
>> > >
>> > > Or - another way to put it - what is it that makes these two records
>> > in
>> > > your Solr index 'the same', and what are the unique id's for those two
>> > > entries in the Solr index?  How are those id's related to your
>> > original
>> > > data?
>> > >
>> > > [1] not only unique, but immutable.  I.E. if you update a row in your
>> > > database, the unique id derived from that row has to be the same as it
>> > would
>> > > have been before the update.  Otherwise, there's nothing for Solr to
>> > > recognize as a duplicate entry, and do a 'delete' and 'insert' instead
>> > of
>> > > just an 'insert'.
>> > >
>> > > Bob Sandiford | Lead Software Engineer | SirsiDynix
>> > > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
>> > > www.sirsidynix.com
>> > >
>> > >
>> > > > -----Original Message-----
>> > > > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
>> > > > Sent: Thursday, July 07, 2011 9:15 AM
>> > > > To: solr-user@lucene.apache.org
>> > > > Subject: updating existing data in index vs inserting new data in
>> > index
>> > > >
>> > > > Hello all
>> > > >
>> > > > I'm using Solr 3.2 and am confused about updating existing data in
>> > an
>> > > > index.
>> > > >
>> > > > According to the DataImportHandler Wiki:
>> > > >
>> > > > *"delta-import* : For incremental imports and change detection run
>> > the
>> > > > command `http://<host>:<port>/solr/dataimport?command=delta-import .
>> > It
>> > > > supports the same clean, commit, optimize and debug parameters as
>> > > > full-import command."
>> > > >
>> > > > I know delta-import will find new data in the database and insert it
>> > > > into
>> > > > the index.  My problem is how it handles updates where I've got a
>> > record
>> > > > that exists in the index and the database, the database record is
>> > > > changed
>> > > > and I want to incorporate those changes in the existing record in
>> > the
>> > > > index.
>> > > >  IOW I don't want to insert it again.
>> > > >
>> > > > I've tried this and wound up with 2 records with the same key in the
>> > > > index.
>> > > >  The first contains the original db values found when the index was
>> > > > created,
>> > > > the 2nd contains the db values after the record was changed.
>> > > >
>> > > > I've also found this
>> > > >
>> > http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720
>> > > > 66.n3.nabble.com%2FDelta-import-with-solrj-client-
>> > tp1085763p1086173.html
>> > > > the
>> > > > subject is 'Delta-import with solrj client'
>> > > >
>> > > > "Greetings. I have a *solrj* client for fetching data from database.
>> > I
>> > > > am
>> > > > using *delta*-*import* for fetching data. If a column is changed in
>> > > > database
>> > > > using timestamp with *delta*-*import* i get the latest column
>> > indexed
>> > > > but
>> > > > there are *duplicate* values in the index similar to the column but
>> > the
>> > > > data
>> > > > is older. This works with cleaning the index but i want to update
>> > the
>> > > > index
>> > > > without cleaning it. Is there a way to just update the index with
>> > the
>> > > > updated column without having *duplicate* values. Appreciate for any
>> > > > feedback.
>> > > >
>> > > > Hando"
>> > > >
>> > > > There are 2 responses:
>> > > >
>> > > > "Short answer is no, there isn't a way. *Solr* doesn't have the
>> > concept
>> > > > of
>> > > > 'Update' to an indexed document. You need to add the full document
>> > (all
>> > > > 'columns') each time any one field changes. If doing that in your
>> > > > DataImportHandler logic is difficult you may need to write a
>> > separate
>> > > > Update
>> > > > Service that does:
>> > > >
>> > > > 1) Read UniqueID, UpdatedColumn(s)  from database
>> > > > 2) Using UniqueID Retrieve document from *Solr*
>> > > > 3) Add/Update field(s) with updated column(s)
>> > > > 4) Add document back to *Solr*
>> > > >
>> > > > Although, if you use DIH to do a full *import*, using the same query
>> > in
>> > > > your *Delta*-*Import* to get the whole document shouldn't be that
>> > > > difficult."
>> > > >
>> > > > and
>> > > >
>> > > > "Hi,
>> > > >
>> > > > Make sure you use a proper "ID" field, which does *not* change even
>> > if
>> > > > the
>> > > > content in the database changes. In this way, when your
>> > > > *delta*-*import* fetches
>> > > > changed rows to index, they will update the existing rows in your
>> > index.
>> > > > "
>> > > >
>> > > > I have an ID field that doesn't change.  It is the primary key field
>> > > > from
>> > > > the database table I am trying to index and I have verified it is
>> > > > unique.
>> > > >
>> > > > So, does Solr allow updates (not inserts) of existing records?  Is
>> > > > anyone
>> > > > able to do this?
>> > > >
>> > > > Mark
>> > >
>> > >
>>
>>
>

Re: updating existing data in index vs inserting new data in index

Reply via email to