Re: solr as the data store
Problems: 1) If you get the schema wrong it is painful to live with. You may need to extract all data and reindex with your new schema. To ease this I wrote an XSL script that massaged the default Solr XML output into the Solr XML input format. Extracting is really slow and this process took days. 2) If you get a corrupted Lucene index, restoring from an old one will throw away all of your intervening updates. I really don't recommend this course. If you can archive your input data on tape, you may be very happy you did so. If you must use a DB, MySQL has a table format that archives into ZIP format files. (It does not make indexes; all searches are table scans.) Lance On Wed, Jan 28, 2009 at 12:37 PM, Ian Connor wrote: > Hi All, > > Is anyone using Solr (and thus the lucene index) as there database store. > > Up to now, we have been using a database to build Solr from. However, given > that lucene already keeps the stored data intact, and that rebuilding from > solr to solr can be very fast, the need for the separate database does not > seem so necessary. > > It seems totally possible to maintain just the solr shards and treat them > as > the database (backups, redundancy, etc are already built right in). The > idea > that we would need to rebuild from scratch seems unlikely and the speed > boost by using solr shards for data massaging and reindexing seems very > appealing. > > Has anyone else thought about this or done this and ran into problems that > caused them to go back to a seperate database model? Is there a critical > need you can think is missing? > > -- > Regards, > > Ian Connor > -- Lance Norskog goks...@gmail.com 650-922-8831 (US)
Re: solr as the data store
We've been using a Lucene index as the main data-store for ActiveMath, the indexing process of which takes the XML fragments apart and stores them in an organized way, including storage of the relationships both ways. The difference between SQL and Lucene in this case? Pure java was the major reason back then. The performance of Lucene stayed top as well (compared to XML databases). As of now because of 2.0, we had to split out the storage of the fragments themselves, keeping the rest in Lucene, because the functionality to reliably read and write fields and never have them be loaded as single strings has been missing us. Maybe it's back in 2.3... Our fragments' size vary from 20 byte to 2 MBytes... about 25k of them is normal. I'm looking forward to, one day, recycle it all to solr which would finally take care of it all in terms of index update and read management, adding a Luke-like web-access. Scalability of Lucene has always been top. Joins are not there... I could get along without them. Summaries are also not really there... but again, we could get along without them. paul Le 28-janv.-09 à 21:37, Ian Connor a écrit : Hi All, Is anyone using Solr (and thus the lucene index) as there database store. Up to now, we have been using a database to build Solr from. However, given that lucene already keeps the stored data intact, and that rebuilding from solr to solr can be very fast, the need for the separate database does not seem so necessary. It seems totally possible to maintain just the solr shards and treat them as the database (backups, redundancy, etc are already built right in). The idea that we would need to rebuild from scratch seems unlikely and the speed boost by using solr shards for data massaging and reindexing seems very appealing. Has anyone else thought about this or done this and ran into problems that caused them to go back to a seperate database model? Is there a critical need you can think is missing? -- Regards, Ian Connor smime.p7s Description: S/MIME cryptographic signature
Re: solr as the data store
The other option was actually couchdb. It was very nice but the benefits were not compelling compared to the pure simplicity of just having solr. With the replication just so simple to setup now - it really does seem to solve all the problems we are looking for in a redundant distributed storage solution. On Thu, Jan 29, 2009 at 12:50 AM, Neal Richter wrote: > You might examine what the Apache CouchDB people have done. > > It's a document oriented DB that is able to use JSON structured > documents combined with Lucene indexing of the documents with a > RESTful HTTP interface. > > It's a stretch, and written in Erlang.. but perhaps there is some > inspiration to be had for 'solr as the data store'. > > - Neal Richter > -- Regards, Ian Connor
Re: solr as the data store
You might examine what the Apache CouchDB people have done. It's a document oriented DB that is able to use JSON structured documents combined with Lucene indexing of the documents with a RESTful HTTP interface. It's a stretch, and written in Erlang.. but perhaps there is some inspiration to be had for 'solr as the data store'. - Neal Richter
Re: solr as the data store
But do note that there's also no requirement that all documents have the same fields. So you could consider storing a special "meta document" that had *no* fields in common with any other document that records whatever information you want about the current state of the index. Best Erick On Wed, Jan 28, 2009 at 5:15 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > There is no existing internal field like that. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message > > From: Ian Connor > > To: solr-user@lucene.apache.org > > Sent: Wednesday, January 28, 2009 4:59:28 PM > > Subject: Re: solr as the data store > > > > I am planning with backups, the recovery will only be incremental. > > > > Is there an internal field to know when the last document hit the index > or > > is this best to build your own "created_at" type field to know when you > need > > to rebuild from? > > > > After the backup is restored, this field could be read and then the > restore > > from that time could kick in. > > > > On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd wrote: > > > > > Although the idea that you will need to rebuild from scratch is > > > unlikely, you might want to fully understand the cost of recovery if > you > > > *do* have to. > > > > > > If it's incredibly expensive(time or money), you need to keep that in > > > mind. > > > > > > -Todd > > > > > > > > > -Original Message- > > > From: Ian Connor [mailto:ian.con...@gmail.com] > > > Sent: Wednesday, January 28, 2009 12:38 PM > > > To: solr > > > Subject: solr as the data store > > > > > > Hi All, > > > > > > Is anyone using Solr (and thus the lucene index) as there database > > > store. > > > > > > Up to now, we have been using a database to build Solr from. However, > > > given > > > that lucene already keeps the stored data intact, and that rebuilding > > > from > > > solr to solr can be very fast, the need for the separate database does > > > not > > > seem so necessary. > > > > > > It seems totally possible to maintain just the solr shards and treat > > > them as > > > the database (backups, redundancy, etc are already built right in). The > > > idea > > > that we would need to rebuild from scratch seems unlikely and the speed > > > boost by using solr shards for data massaging and reindexing seems very > > > appealing. > > > > > > Has anyone else thought about this or done this and ran into problems > > > that > > > caused them to go back to a seperate database model? Is there a > critical > > > need you can think is missing? > > > > > > -- > > > Regards, > > > > > > Ian Connor > > > > > > > > > > > -- > > Regards, > > > > Ian Connor > > 1 Leighton St #723 > > Cambridge, MA 02141 > > Call Center Phone: +1 (714) 239 3875 (24 hrs) > > Fax: +1(770) 818 5697 > > Skype: ian.connor > >
Re: solr as the data store
There is no existing internal field like that. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ian Connor > To: solr-user@lucene.apache.org > Sent: Wednesday, January 28, 2009 4:59:28 PM > Subject: Re: solr as the data store > > I am planning with backups, the recovery will only be incremental. > > Is there an internal field to know when the last document hit the index or > is this best to build your own "created_at" type field to know when you need > to rebuild from? > > After the backup is restored, this field could be read and then the restore > from that time could kick in. > > On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd wrote: > > > Although the idea that you will need to rebuild from scratch is > > unlikely, you might want to fully understand the cost of recovery if you > > *do* have to. > > > > If it's incredibly expensive(time or money), you need to keep that in > > mind. > > > > -Todd > > > > > > -Original Message- > > From: Ian Connor [mailto:ian.con...@gmail.com] > > Sent: Wednesday, January 28, 2009 12:38 PM > > To: solr > > Subject: solr as the data store > > > > Hi All, > > > > Is anyone using Solr (and thus the lucene index) as there database > > store. > > > > Up to now, we have been using a database to build Solr from. However, > > given > > that lucene already keeps the stored data intact, and that rebuilding > > from > > solr to solr can be very fast, the need for the separate database does > > not > > seem so necessary. > > > > It seems totally possible to maintain just the solr shards and treat > > them as > > the database (backups, redundancy, etc are already built right in). The > > idea > > that we would need to rebuild from scratch seems unlikely and the speed > > boost by using solr shards for data massaging and reindexing seems very > > appealing. > > > > Has anyone else thought about this or done this and ran into problems > > that > > caused them to go back to a seperate database model? Is there a critical > > need you can think is missing? > > > > -- > > Regards, > > > > Ian Connor > > > > > > -- > Regards, > > Ian Connor > 1 Leighton St #723 > Cambridge, MA 02141 > Call Center Phone: +1 (714) 239 3875 (24 hrs) > Fax: +1(770) 818 5697 > Skype: ian.connor
Re: solr as the data store
I am planning with backups, the recovery will only be incremental. Is there an internal field to know when the last document hit the index or is this best to build your own "created_at" type field to know when you need to rebuild from? After the backup is restored, this field could be read and then the restore from that time could kick in. On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd wrote: > Although the idea that you will need to rebuild from scratch is > unlikely, you might want to fully understand the cost of recovery if you > *do* have to. > > If it's incredibly expensive(time or money), you need to keep that in > mind. > > -Todd > > > -Original Message- > From: Ian Connor [mailto:ian.con...@gmail.com] > Sent: Wednesday, January 28, 2009 12:38 PM > To: solr > Subject: solr as the data store > > Hi All, > > Is anyone using Solr (and thus the lucene index) as there database > store. > > Up to now, we have been using a database to build Solr from. However, > given > that lucene already keeps the stored data intact, and that rebuilding > from > solr to solr can be very fast, the need for the separate database does > not > seem so necessary. > > It seems totally possible to maintain just the solr shards and treat > them as > the database (backups, redundancy, etc are already built right in). The > idea > that we would need to rebuild from scratch seems unlikely and the speed > boost by using solr shards for data massaging and reindexing seems very > appealing. > > Has anyone else thought about this or done this and ran into problems > that > caused them to go back to a seperate database model? Is there a critical > need you can think is missing? > > -- > Regards, > > Ian Connor > -- Regards, Ian Connor 1 Leighton St #723 Cambridge, MA 02141 Call Center Phone: +1 (714) 239 3875 (24 hrs) Fax: +1(770) 818 5697 Skype: ian.connor
RE: solr as the data store
Although the idea that you will need to rebuild from scratch is unlikely, you might want to fully understand the cost of recovery if you *do* have to. If it's incredibly expensive(time or money), you need to keep that in mind. -Todd -Original Message- From: Ian Connor [mailto:ian.con...@gmail.com] Sent: Wednesday, January 28, 2009 12:38 PM To: solr Subject: solr as the data store Hi All, Is anyone using Solr (and thus the lucene index) as there database store. Up to now, we have been using a database to build Solr from. However, given that lucene already keeps the stored data intact, and that rebuilding from solr to solr can be very fast, the need for the separate database does not seem so necessary. It seems totally possible to maintain just the solr shards and treat them as the database (backups, redundancy, etc are already built right in). The idea that we would need to rebuild from scratch seems unlikely and the speed boost by using solr shards for data massaging and reindexing seems very appealing. Has anyone else thought about this or done this and ran into problems that caused them to go back to a seperate database model? Is there a critical need you can think is missing? -- Regards, Ian Connor
Re: solr as the data store
This is perfectly fine. Of course, you lose any relational model. If you don't have or don't need one, why not. It used to be the case that backups of live Lucene indices were hard, so people preferred having a RDBMS be the primary data source, the one they know how to back up and maintain well. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Ian Connor > To: solr > Sent: Wednesday, January 28, 2009 3:37:55 PM > Subject: solr as the data store > > Hi All, > > Is anyone using Solr (and thus the lucene index) as there database store. > > Up to now, we have been using a database to build Solr from. However, given > that lucene already keeps the stored data intact, and that rebuilding from > solr to solr can be very fast, the need for the separate database does not > seem so necessary. > > It seems totally possible to maintain just the solr shards and treat them as > the database (backups, redundancy, etc are already built right in). The idea > that we would need to rebuild from scratch seems unlikely and the speed > boost by using solr shards for data massaging and reindexing seems very > appealing. > > Has anyone else thought about this or done this and ran into problems that > caused them to go back to a seperate database model? Is there a critical > need you can think is missing? > > -- > Regards, > > Ian Connor
Re: solr as the data store
One thing to keep in mind is that things like joins are impossible in solr, but easy in a database. So if you ever need to do stuff like run reports, you're probably better off with a database to query on - unless you cover your bases very well in the solr index. Thanks for your time! Matthew Runo Software Engineer, Zappos.com mr...@zappos.com - 702-943-7833 On Jan 28, 2009, at 12:37 PM, Ian Connor wrote: Hi All, Is anyone using Solr (and thus the lucene index) as there database store. Up to now, we have been using a database to build Solr from. However, given that lucene already keeps the stored data intact, and that rebuilding from solr to solr can be very fast, the need for the separate database does not seem so necessary. It seems totally possible to maintain just the solr shards and treat them as the database (backups, redundancy, etc are already built right in). The idea that we would need to rebuild from scratch seems unlikely and the speed boost by using solr shards for data massaging and reindexing seems very appealing. Has anyone else thought about this or done this and ran into problems that caused them to go back to a seperate database model? Is there a critical need you can think is missing? -- Regards, Ian Connor