Best practice advice needed!
Hi, We have an index of courses (about 4 million docs in prod) and we have a nightly that would pick up newly added courses and update the index accordingly. There is another Enterprise system that shares the same table and that could delete data from the table too. I just want to know what would be the best practice to find out deleted records and remove it from my index. Unfortunately for us, we dont maintain a history of the deleted records and thats a big bane. Please do advice on what might be the best way to implement this? -Sundar _ Movies, sports news! Get your daily entertainment fix, only on live.com http://www.live.com/?scope=videoform=MICOAL
Re: Best practice advice needed!
I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users. To remove records from an index... best-by performance is to have Master-Slave SOLR instances, remove data from Master SOLR, and commit/synchronize with Slave nightly (when traffic is lowest). SOLR won't be in-sync with database, but you can always retrieve PKs from SOLR, check database for those PKs, and 'filter' output... -- Thanks, Fuad Efendi 416-993-2060(cell) Tokenizer Inc. == http://www.linkedin.com/in/liferay Quoting sundar shankar [EMAIL PROTECTED]: Hi, We have an index of courses (about 4 million docs in prod) and we have a nightly that would pick up newly added courses and update the index accordingly. There is another Enterprise system that shares the same table and that could delete data from the table too. I just want to know what would be the best practice to find out deleted records and remove it from my index. Unfortunately for us, we dont maintain a history of the deleted records and thats a big bane. Please do advice on what might be the best way to implement this? -Sundar _ Movies, sports news! Get your daily entertainment fix, only on live.com http://www.live.com/?scope=videoform=MICOAL
Re: Best practice advice needed!
How long does it take to build the entire index? Can you just rebuild it from scratch every night? That would be the simplest. Best Erick On Thu, Sep 25, 2008 at 12:48 PM, sundar shankar [EMAIL PROTECTED]wrote: Hi, We have an index of courses (about 4 million docs in prod) and we have a nightly that would pick up newly added courses and update the index accordingly. There is another Enterprise system that shares the same table and that could delete data from the table too. I just want to know what would be the best practice to find out deleted records and remove it from my index. Unfortunately for us, we dont maintain a history of the deleted records and thats a big bane. Please do advice on what might be the best way to implement this? -Sundar _ Movies, sports news! Get your daily entertainment fix, only on live.com http://www.live.com/?scope=videoform=MICOAL
Re: Best practice advice needed!
This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage collection: fetch every ID in the Solr DB and check whether that exists in the source DB, then delete the ones that don't exist. * mark for deletion: change the DB to leave the record but flag it as deleted in a boolean row, then delete from Solr all deleted items in the source DB. The items marked for deletion can be deleted from the source DB at a later time. * indexer scratchpad DB: a database used by the indexing code which shows all the IDs currently in the index, usually with a last modified time. This is similar to the full sweep, but may be much faster with a dedicated DB. This can get arbitrarily fancy. Web spiders work like this. wunder On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote: I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users.
Re: Best practice advice needed!
That should be flag it in a boolean column. --wunder On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote: This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage collection: fetch every ID in the Solr DB and check whether that exists in the source DB, then delete the ones that don't exist. * mark for deletion: change the DB to leave the record but flag it as deleted in a boolean row, then delete from Solr all deleted items in the source DB. The items marked for deletion can be deleted from the source DB at a later time. * indexer scratchpad DB: a database used by the indexing code which shows all the IDs currently in the index, usually with a last modified time. This is similar to the full sweep, but may be much faster with a dedicated DB. This can get arbitrarily fancy. Web spiders work like this. wunder On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote: I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users.
RE: Best practice advice needed!
Great Thanks. Date: Thu, 25 Sep 2008 11:54:32 -0700 Subject: Re: Best practice advice needed! From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org That should be flag it in a boolean column. --wunder On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote: This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage collection: fetch every ID in the Solr DB and check whether that exists in the source DB, then delete the ones that don't exist. * mark for deletion: change the DB to leave the record but flag it as deleted in a boolean row, then delete from Solr all deleted items in the source DB. The items marked for deletion can be deleted from the source DB at a later time. * indexer scratchpad DB: a database used by the indexing code which shows all the IDs currently in the index, usually with a last modified time. This is similar to the full sweep, but may be much faster with a dedicated DB. This can get arbitrarily fancy. Web spiders work like this. wunder On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote: I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users. _ Searching for the best deals on travel? Visit MSN Travel. http://in.msn.com/coxandkings
RE: Best practice advice needed!
Hi Faud, Since I dont have too much of data (4 million) I dont have a master slave setup yet. How big a change would that be? Date: Thu, 25 Sep 2008 10:08:51 -0700 From: [EMAIL PROTECTED] To: solr-user@lucene.apache.org Subject: Re: Best practice advice needed! I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users. To remove records from an index... best-by performance is to have Master-Slave SOLR instances, remove data from Master SOLR, and commit/synchronize with Slave nightly (when traffic is lowest). SOLR won't be in-sync with database, but you can always retrieve PKs from SOLR, check database for those PKs, and 'filter' output... -- Thanks, Fuad Efendi 416-993-2060(cell) Tokenizer Inc. == http://www.linkedin.com/in/liferay Quoting sundar shankar [EMAIL PROTECTED]: Hi, We have an index of courses (about 4 million docs in prod) and we have a nightly that would pick up newly added courses and update the index accordingly. There is another Enterprise system that shares the same table and that could delete data from the table too. I just want to know what would be the best practice to find out deleted records and remove it from my index. Unfortunately for us, we dont maintain a history of the deleted records and thats a big bane. Please do advice on what might be the best way to implement this? -Sundar _ Movies, sports news! Get your daily entertainment fix, only on live.com http://www.live.com/?scope=videoform=MICOAL _ Movies, sports news! Get your daily entertainment fix, only on live.com http://www.live.com/?scope=videoform=MICOAL
Re: Best practice advice needed!
About web spiders: I simply use last modified timestamp field in SOLR, and I expire items after 30 days. If item was updated (timestamp changed) - it won't be deleted. If I delete it from database - it will be deleted from SOLR within 30 days. Spiders don't need 'transactional' updates. Recently I moved to HBase from MySQL. row::column structure is physically sorted, column-oriented structure. SOLR lazily follows database updates; it's very specific case... Quoting Walter Underwood [EMAIL PROTECTED]: That should be flag it in a boolean column. --wunder On 9/25/08 11:51 AM, Walter Underwood [EMAIL PROTECTED] wrote: This will cause the result counts to be wrong and the deleted docs will stay in the search index forever. Some approaches for incremental update: * full sweep garbage collection: fetch every ID in the Solr DB and check whether that exists in the source DB, then delete the ones that don't exist. * mark for deletion: change the DB to leave the record but flag it as deleted in a boolean row, then delete from Solr all deleted items in the source DB. The items marked for deletion can be deleted from the source DB at a later time. * indexer scratchpad DB: a database used by the indexing code which shows all the IDs currently in the index, usually with a last modified time. This is similar to the full sweep, but may be much faster with a dedicated DB. This can get arbitrarily fancy. Web spiders work like this. wunder On 9/25/08 10:08 AM, Fuad Efendi [EMAIL PROTECTED] wrote: I am guessing your Enterprise system deletes/updates tables in RDBMS, and your SOLR indexes that data. Additionally to that, you have front-end interacting with SOLR and with RDBMS. At front-end level, in case of a search sent to SOLR returning primary keys for data, you may check your database using primary keys returned by SOLR before committing output to end users.