I'd restart Solr after changing the schema.xml. The delta import does NOT
require restart or anything else like that.....

The fact that two records are displayed is not what I'd expect. But Solr
absolutely handles the replace via <uniqueKey>. So I suspect that you're
not actually doing what you expect. A little-known aid for debugging DIH
is solr/admin/dataimport.jsp, that might give you some joy.

But, to summarize. This should work fine for DIH as far as Solr is concerned
assuming that <uniqueKey> is properly defined. In you query above that
returns two documents, can you paste the entire response with &fl=* attached?
I'm guessing that the data in your index isn't what you're expecting...

Also, you might want to get a copy of Luke and examine your index, there's a
wealth of infomration


Best
Erick


On Thu, Jul 7, 2011 at 11:12 AM, Mark juszczec <mark.juszc...@gmail.com> wrote:
> Erick
>
> I used to, but now I find I must have commented it out in a fit of rage ;-)
>
> This could be the whole problem.
>
> I have verified via admin schema browser that the field is ORDER_ID and will
> double check I refer to it in upper case in the appropriate places in the
> Solr config scheme.
>
> Curiously, the admin schema browser display for ORDER_ID says "hasDeletions:
> false"  - which seems the opposite of what I want.  I want to be able to
> delete duplicates.  Or am I interpreting this field wrong?
>
> In order to check for duplicates, I am going to using the admin browser to
> enter the following in the Make A Query box:
>
> TABLE_ID:1 AND ORDER_ID:674659
>
> When I click search and view the results, 2 records are displayed.  One has
> the original values, one has the changed values.  I haven't examined the xml
> (via view source) too closely and the next time I run I will look for
> something indicating one of the records is inactive.
>
> When you say "change your schema" do you mean via a delta import or by
> modifying the config files or both?  FWIW, I am deleting the index on the
> file system, doing a full import, modifying the data in the database and
> then doing a delta import.
>
> I am not restarting Solr at all in this process.
>
> I understand Solr does not perform key management.  You described exactly
> what I meant.  Sorry for any confusion.
>
> Mark
>
> On Thu, Jul 7, 2011 at 10:52 AM, Erick Erickson 
> <erickerick...@gmail.com>wrote:
>
>> Let me re-state a few things to see if I've got it right:
>>
>> > your schema.xml file has an entry like <uniqueKey>order_id</uniqueKey>,
>> right?
>>
>> > given this definition, any document added with an order_id that already
>> exists in the
>>   Solr index will be replaced. i.e. you should have one and only one
>> document with a
>>   given order_id.
>>
>> > case matters. Check via the admin page ("schema browser") to see if you
>> have
>>   two fields, order_id an ORDER_ID.
>>
>> > How are you checking that your docs are duplicates? If you do a search on
>>   order_id, you should get back one and only one document (assuming the
>>   definition above). A document that's deleted will just be marked as
>> deleted,
>>   the data won't be purged from the index. It won't show in search results,
>> but
>>   it will show if you use lower-level ways to access the data.
>>
>> > Whenever you change your schema, it's best to clean the index, restart
>> the server and
>>    re-index from scratch. Solr won't retroactively remove duplicate
>> <uniqueKey> entries.
>>
>> > On the stats admin/stats page you should see maxDocs and numDocs. The
>> difference
>>   between these should be the number of deleted documents.
>>
>> > Solr doesn't "manage" unique keys. All that happens is Solr will replace
>> any
>>   pre-existing documents where *you've* defined the <uniqueKey> when a
>>   new doc is added...
>>
>> Hope this helps
>> Erick
>>
>> On Thu, Jul 7, 2011 at 10:16 AM, Mark juszczec <mark.juszc...@gmail.com>
>> wrote:
>> > Bob
>> >
>> > No, I don't.  Let me look into that and post my results.
>> >
>> > Mark
>> >
>> >
>> > On Thu, Jul 7, 2011 at 10:14 AM, Bob Sandiford <
>> bob.sandif...@sirsidynix.com
>> >> wrote:
>> >
>> >> Hi, Mark.
>> >>
>> >> I haven't used DIH myself - so I'll need to leave comments on your set
>> up
>> >> to others who have done so.
>> >>
>> >> Another question - after your initial index create (and after each
>> delta),
>> >> do you run a 'commit'?  Do you run an 'optimize'?  (Without the
>> optimize,
>> >> 'deleted' records still show up in query results...)
>> >>
>> >> Bob Sandiford | Lead Software Engineer | SirsiDynix
>> >> P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
>> >> www.sirsidynix.com
>> >>
>> >>
>> >> > -----Original Message-----
>> >> > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
>> >> > Sent: Thursday, July 07, 2011 10:04 AM
>> >> > To: solr-user@lucene.apache.org
>> >> > Subject: Re: updating existing data in index vs inserting new data in
>> >> > index
>> >> >
>> >> > Bob
>> >> >
>> >> > Thanks very much for the reply!
>> >> >
>> >> > I am using a unique integer called order_id as the Solr index key.
>> >> >
>> >> > My query, deltaQuery and deltaImportQuery are below:
>> >> >
>> >> > <entity name="item1"
>> >> >   pk="ORDER_ID"
>> >> >   query="select 1 as TABLE_ID , orders.order_id,
>> >> > orders.order_booked_ind,
>> >> > orders.order_dt, orders.cancel_dt,     orders.account_manager_id,
>> >> > orders.of_header_id, orders.order_status_lov_id, orders.order_type_id,
>> >> > orders.approved_discount_pct, orders.campaign_nm,
>> >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
>> >> > orders"
>> >> >
>> >> >   deltaImportQuery="select 1 as TABLE_ID, orders.order_id,
>> >> > orders.order_booked_ind, orders.order_dt, orders.cancel_dt,
>> >> > orders.account_manager_id, orders.of_header_id,
>> >> > orders.order_status_lov_id,
>> >> > orders.order_type_id, orders.approved_discount_pct,
>> orders.campaign_nm,
>> >> > orders.approved_by_cd,orders.advertiser_id, orders.agency_id from
>> orders
>> >> > where orders.order_id = '${dataimporter.delta.ORDER_ID}'"
>> >> >
>> >> >   deltaQuery="select orders.order_id from orders where
>> orders.change_dt
>> >> > >
>> >> > to_date('${dataimporter.last_index_time}','YYYY-MM-DD HH24:MI:SS')" >
>> >> >         </entity>
>> >> >
>> >> > The test I am running is two part:
>> >> >
>> >> > 1.  After I do a full import of the index, I insert a brand new record
>> >> > (with
>> >> > a never existed before order_id) in the database.  The delta import
>> >> > picks
>> >> > this up just fine.
>> >> >
>> >> > 2.  After the full import, I modify a record with an order_id that
>> >> > already
>> >> > shows up in the index.  I have verified there is only one record with
>> >> > this
>> >> > order_id in both the index and the db before I do the delta update.
>> >> >
>> >> > I guess the question is, am I screwing myself up by defining my own
>> Solr
>> >> > index key?  I want to, ultimately, be able to search on ORDER_ID in
>> the
>> >> > Solr
>> >> > index.  However, the docs say (I think) a field does not have to be
>> the
>> >> > Solr
>> >> > primary key in order to be searchable.  Would I be better off letting
>> >> > Solr
>> >> > manage the keys?
>> >> >
>> >> > Mark
>> >> >
>> >> > On Thu, Jul 7, 2011 at 9:24 AM, Bob Sandiford
>> >> > <bob.sandif...@sirsidynix.com>wrote:
>> >> >
>> >> > > What are you using as the unique id in your Solr index?  It sounds
>> >> > like you
>> >> > > may have one value as your Solr index unique id, which bears no
>> >> > resemblance
>> >> > > to a unique[1] id derived from your data...
>> >> > >
>> >> > > Or - another way to put it - what is it that makes these two records
>> >> > in
>> >> > > your Solr index 'the same', and what are the unique id's for those
>> two
>> >> > > entries in the Solr index?  How are those id's related to your
>> >> > original
>> >> > > data?
>> >> > >
>> >> > > [1] not only unique, but immutable.  I.E. if you update a row in
>> your
>> >> > > database, the unique id derived from that row has to be the same as
>> it
>> >> > would
>> >> > > have been before the update.  Otherwise, there's nothing for Solr to
>> >> > > recognize as a duplicate entry, and do a 'delete' and 'insert'
>> instead
>> >> > of
>> >> > > just an 'insert'.
>> >> > >
>> >> > > Bob Sandiford | Lead Software Engineer | SirsiDynix
>> >> > > P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
>> >> > > www.sirsidynix.com
>> >> > >
>> >> > >
>> >> > > > -----Original Message-----
>> >> > > > From: Mark juszczec [mailto:mark.juszc...@gmail.com]
>> >> > > > Sent: Thursday, July 07, 2011 9:15 AM
>> >> > > > To: solr-user@lucene.apache.org
>> >> > > > Subject: updating existing data in index vs inserting new data in
>> >> > index
>> >> > > >
>> >> > > > Hello all
>> >> > > >
>> >> > > > I'm using Solr 3.2 and am confused about updating existing data in
>> >> > an
>> >> > > > index.
>> >> > > >
>> >> > > > According to the DataImportHandler Wiki:
>> >> > > >
>> >> > > > *"delta-import* : For incremental imports and change detection run
>> >> > the
>> >> > > > command `http://<host>:<port>/solr/dataimport?command=delta-import
>> .
>> >> > It
>> >> > > > supports the same clean, commit, optimize and debug parameters as
>> >> > > > full-import command."
>> >> > > >
>> >> > > > I know delta-import will find new data in the database and insert
>> it
>> >> > > > into
>> >> > > > the index.  My problem is how it handles updates where I've got a
>> >> > record
>> >> > > > that exists in the index and the database, the database record is
>> >> > > > changed
>> >> > > > and I want to incorporate those changes in the existing record in
>> >> > the
>> >> > > > index.
>> >> > > >  IOW I don't want to insert it again.
>> >> > > >
>> >> > > > I've tried this and wound up with 2 records with the same key in
>> the
>> >> > > > index.
>> >> > > >  The first contains the original db values found when the index
>> was
>> >> > > > created,
>> >> > > > the 2nd contains the db values after the record was changed.
>> >> > > >
>> >> > > > I've also found this
>> >> > > >
>> >> >
>> http://search.lucidimagination.com/search/out?u=http%3A%2F%2Flucene.4720
>> >> > > > 66.n3.nabble.com%2FDelta-import-with-solrj-client-
>> >> > tp1085763p1086173.html
>> >> > > > the
>> >> > > > subject is 'Delta-import with solrj client'
>> >> > > >
>> >> > > > "Greetings. I have a *solrj* client for fetching data from
>> database.
>> >> > I
>> >> > > > am
>> >> > > > using *delta*-*import* for fetching data. If a column is changed
>> in
>> >> > > > database
>> >> > > > using timestamp with *delta*-*import* i get the latest column
>> >> > indexed
>> >> > > > but
>> >> > > > there are *duplicate* values in the index similar to the column
>> but
>> >> > the
>> >> > > > data
>> >> > > > is older. This works with cleaning the index but i want to update
>> >> > the
>> >> > > > index
>> >> > > > without cleaning it. Is there a way to just update the index with
>> >> > the
>> >> > > > updated column without having *duplicate* values. Appreciate for
>> any
>> >> > > > feedback.
>> >> > > >
>> >> > > > Hando"
>> >> > > >
>> >> > > > There are 2 responses:
>> >> > > >
>> >> > > > "Short answer is no, there isn't a way. *Solr* doesn't have the
>> >> > concept
>> >> > > > of
>> >> > > > 'Update' to an indexed document. You need to add the full document
>> >> > (all
>> >> > > > 'columns') each time any one field changes. If doing that in your
>> >> > > > DataImportHandler logic is difficult you may need to write a
>> >> > separate
>> >> > > > Update
>> >> > > > Service that does:
>> >> > > >
>> >> > > > 1) Read UniqueID, UpdatedColumn(s)  from database
>> >> > > > 2) Using UniqueID Retrieve document from *Solr*
>> >> > > > 3) Add/Update field(s) with updated column(s)
>> >> > > > 4) Add document back to *Solr*
>> >> > > >
>> >> > > > Although, if you use DIH to do a full *import*, using the same
>> query
>> >> > in
>> >> > > > your *Delta*-*Import* to get the whole document shouldn't be that
>> >> > > > difficult."
>> >> > > >
>> >> > > > and
>> >> > > >
>> >> > > > "Hi,
>> >> > > >
>> >> > > > Make sure you use a proper "ID" field, which does *not* change
>> even
>> >> > if
>> >> > > > the
>> >> > > > content in the database changes. In this way, when your
>> >> > > > *delta*-*import* fetches
>> >> > > > changed rows to index, they will update the existing rows in your
>> >> > index.
>> >> > > > "
>> >> > > >
>> >> > > > I have an ID field that doesn't change.  It is the primary key
>> field
>> >> > > > from
>> >> > > > the database table I am trying to index and I have verified it is
>> >> > > > unique.
>> >> > > >
>> >> > > > So, does Solr allow updates (not inserts) of existing records?  Is
>> >> > > > anyone
>> >> > > > able to do this?
>> >> > > >
>> >> > > > Mark
>> >> > >
>> >> > >
>> >>
>> >>
>> >
>>
>

Reply via email to