Best practice advice needed!

2008-09-25 Thread sundar shankar
Hi,
  We have an index of courses (about 4 million docs in prod) and we have a 
nightly that would pick up newly added courses and update the index 
accordingly. There is another Enterprise system that shares the same table and 
that could delete data from the table too. 

I just want to know what would be the best practice to find out deleted records 
and remove it from my index. Unfortunately for us, we dont maintain a history 
of the deleted records and thats a big bane. 

Please do advice on what might be the best way to implement this?

-Sundar

_
Movies, sports  news! Get your daily entertainment fix, only on live.com
http://www.live.com/?scope=videoform=MICOAL

Re: Best practice advice needed!

2008-09-25 Thread Fuad Efendi
I am guessing your Enterprise system deletes/updates tables in RDBMS,  
and your SOLR indexes that data. Additionally to that, you have  
front-end interacting with SOLR and with RDBMS. At front-end level, in  
case of a search sent to SOLR returning primary keys for data, you may  
check your database using primary keys returned by SOLR before  
committing output to end users.


To remove records from an index... best-by performance is to have  
Master-Slave SOLR instances, remove data from Master SOLR, and  
commit/synchronize with Slave nightly (when traffic is lowest). SOLR  
won't be in-sync with database, but you can always retrieve PKs from  
SOLR, check database for those PKs, and 'filter' output...


--
Thanks,

Fuad Efendi
416-993-2060(cell)
Tokenizer Inc.
==
http://www.linkedin.com/in/liferay


Quoting sundar shankar [EMAIL PROTECTED]:


Hi,
  We have an index of courses (about 4 million docs in prod) and  
 we have a nightly that would pick up newly added courses and update  
 the index accordingly. There is another Enterprise system that   
shares the same table and that could delete data from the table too.


I just want to know what would be the best practice to find out   
deleted records and remove it from my index. Unfortunately for us,   
we dont maintain a history of the deleted records and thats a big   
bane.


Please do advice on what might be the best way to implement this?

-Sundar

_
Movies, sports  news! Get your daily entertainment fix, only on live.com
http://www.live.com/?scope=videoform=MICOAL






Re: Best practice advice needed!

2008-09-25 Thread Erick Erickson
How long does it take to build the entire index? Can you just rebuild it
from scratch every night? That would be the simplest.

Best
Erick

On Thu, Sep 25, 2008 at 12:48 PM, sundar shankar
[EMAIL PROTECTED]wrote:

 Hi,
  We have an index of courses (about 4 million docs in prod) and we have
 a nightly that would pick up newly added courses and update the index
 accordingly. There is another Enterprise system that shares the same table
 and that could delete data from the table too.

 I just want to know what would be the best practice to find out deleted
 records and remove it from my index. Unfortunately for us, we dont maintain
 a history of the deleted records and thats a big bane.

 Please do advice on what might be the best way to implement this?

 -Sundar

 _
 Movies, sports  news! Get your daily entertainment fix, only on live.com
 http://www.live.com/?scope=videoform=MICOAL


Re: Best practice advice needed!

2008-09-25 Thread Walter Underwood
This will cause the result counts to be wrong and the deleted docs
will stay in the search index forever.

Some approaches for incremental update:

* full sweep garbage collection: fetch every ID in the Solr DB and
check whether that exists in the source DB, then delete the ones
that don't exist.

* mark for deletion: change the DB to leave the record but flag it
as deleted in a boolean row, then delete from Solr all deleted
items in the source DB. The items marked for deletion can be
deleted from the source DB at a later time.

* indexer scratchpad DB: a database used by the indexing code which
shows all the IDs currently in the index, usually with a last modified
time. This is similar to the full sweep, but may be much faster with
a dedicated DB. This can get arbitrarily fancy. Web spiders work like this.

wunder

On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote:

 I am guessing your Enterprise system deletes/updates tables in RDBMS,
 and your SOLR indexes that data. Additionally to that, you have
 front-end interacting with SOLR and with RDBMS. At front-end level, in
 case of a search sent to SOLR returning primary keys for data, you may
 check your database using primary keys returned by SOLR before
 committing output to end users.



Re: Best practice advice needed!

2008-09-25 Thread Walter Underwood
That should be flag it in a boolean column. --wunder


On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote:

 This will cause the result counts to be wrong and the deleted docs
 will stay in the search index forever.
 
 Some approaches for incremental update:
 
 * full sweep garbage collection: fetch every ID in the Solr DB and
 check whether that exists in the source DB, then delete the ones
 that don't exist.
 
 * mark for deletion: change the DB to leave the record but flag it
 as deleted in a boolean row, then delete from Solr all deleted
 items in the source DB. The items marked for deletion can be
 deleted from the source DB at a later time.
 
 * indexer scratchpad DB: a database used by the indexing code which
 shows all the IDs currently in the index, usually with a last modified
 time. This is similar to the full sweep, but may be much faster with
 a dedicated DB. This can get arbitrarily fancy. Web spiders work like this.
 
 wunder
 
 On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote:
 
 I am guessing your Enterprise system deletes/updates tables in RDBMS,
 and your SOLR indexes that data. Additionally to that, you have
 front-end interacting with SOLR and with RDBMS. At front-end level, in
 case of a search sent to SOLR returning primary keys for data, you may
 check your database using primary keys returned by SOLR before
 committing output to end users.
 



RE: Best practice advice needed!

2008-09-25 Thread sundar shankar
Great Thanks. 



 Date: Thu, 25 Sep 2008 11:54:32 -0700
 Subject: Re: Best practice advice needed!
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 
 That should be flag it in a boolean column. --wunder
 
 
 On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote:
 
  This will cause the result counts to be wrong and the deleted docs
  will stay in the search index forever.
  
  Some approaches for incremental update:
  
  * full sweep garbage collection: fetch every ID in the Solr DB and
  check whether that exists in the source DB, then delete the ones
  that don't exist.
  
  * mark for deletion: change the DB to leave the record but flag it
  as deleted in a boolean row, then delete from Solr all deleted
  items in the source DB. The items marked for deletion can be
  deleted from the source DB at a later time.
  
  * indexer scratchpad DB: a database used by the indexing code which
  shows all the IDs currently in the index, usually with a last modified
  time. This is similar to the full sweep, but may be much faster with
  a dedicated DB. This can get arbitrarily fancy. Web spiders work like this.
  
  wunder
  
  On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote:
  
  I am guessing your Enterprise system deletes/updates tables in RDBMS,
  and your SOLR indexes that data. Additionally to that, you have
  front-end interacting with SOLR and with RDBMS. At front-end level, in
  case of a search sent to SOLR returning primary keys for data, you may
  check your database using primary keys returned by SOLR before
  committing output to end users.
  
 

_
Searching for the best deals on travel? Visit MSN Travel.
http://in.msn.com/coxandkings

RE: Best practice advice needed!

2008-09-25 Thread sundar shankar
Hi Faud,
Since I dont have too much of data (4 million) I dont have a master slave setup 
yet. How big a change would that be?



 Date: Thu, 25 Sep 2008 10:08:51 -0700
 From: [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Subject: Re: Best practice advice needed!
 
 I am guessing your Enterprise system deletes/updates tables in RDBMS,  
 and your SOLR indexes that data. Additionally to that, you have  
 front-end interacting with SOLR and with RDBMS. At front-end level, in  
 case of a search sent to SOLR returning primary keys for data, you may  
 check your database using primary keys returned by SOLR before  
 committing output to end users.
 
 To remove records from an index... best-by performance is to have  
 Master-Slave SOLR instances, remove data from Master SOLR, and  
 commit/synchronize with Slave nightly (when traffic is lowest). SOLR  
 won't be in-sync with database, but you can always retrieve PKs from  
 SOLR, check database for those PKs, and 'filter' output...
 
 -- 
 Thanks,
 
 Fuad Efendi
 416-993-2060(cell)
 Tokenizer Inc.
 ==
 http://www.linkedin.com/in/liferay
 
 
 Quoting sundar shankar [EMAIL PROTECTED]:
 
  Hi,
We have an index of courses (about 4 million docs in prod) and  
   we have a nightly that would pick up newly added courses and update  
   the index accordingly. There is another Enterprise system that   
  shares the same table and that could delete data from the table too.
 
  I just want to know what would be the best practice to find out   
  deleted records and remove it from my index. Unfortunately for us,   
  we dont maintain a history of the deleted records and thats a big   
  bane.
 
  Please do advice on what might be the best way to implement this?
 
  -Sundar
 
  _
  Movies, sports  news! Get your daily entertainment fix, only on live.com
  http://www.live.com/?scope=videoform=MICOAL
 
 
 

_
Movies, sports  news! Get your daily entertainment fix, only on live.com
http://www.live.com/?scope=videoform=MICOAL

Re: Best practice advice needed!

2008-09-25 Thread Fuad Efendi
About web spiders: I simply use last modified timestamp field in  
SOLR, and I expire items after 30 days. If item was updated (timestamp  
changed) - it won't be deleted. If I delete it from database - it will  
be deleted from SOLR within 30 days. Spiders don't need  
'transactional' updates.


Recently I moved to HBase from MySQL. row::column structure is  
physically sorted, column-oriented structure. SOLR lazily follows  
database updates; it's very specific case...



Quoting Walter Underwood [EMAIL PROTECTED]:


That should be flag it in a boolean column. --wunder


On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote:


This will cause the result counts to be wrong and the deleted docs
will stay in the search index forever.

Some approaches for incremental update:

* full sweep garbage collection: fetch every ID in the Solr DB and
check whether that exists in the source DB, then delete the ones
that don't exist.

* mark for deletion: change the DB to leave the record but flag it
as deleted in a boolean row, then delete from Solr all deleted
items in the source DB. The items marked for deletion can be
deleted from the source DB at a later time.

* indexer scratchpad DB: a database used by the indexing code which
shows all the IDs currently in the index, usually with a last modified
time. This is similar to the full sweep, but may be much faster with
a dedicated DB. This can get arbitrarily fancy. Web spiders work like this.

wunder

On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote:


I am guessing your Enterprise system deletes/updates tables in RDBMS,
and your SOLR indexes that data. Additionally to that, you have
front-end interacting with SOLR and with RDBMS. At front-end level, in
case of a search sent to SOLR returning primary keys for data, you may
check your database using primary keys returned by SOLR before
committing output to end users.