Re: solr as the data store

2009-02-01 Thread Lance Norskog
Problems:

1) If you get the schema wrong it is painful to live with. You may need to
extract all data and reindex with your new schema. To ease this I wrote an
XSL script that massaged the default Solr XML output into the Solr XML input
format. Extracting is really slow and this process took days.

2) If you get a corrupted Lucene index, restoring from an old one will throw
away all of your intervening updates.

I really don't recommend this course. If you can archive your input data on
tape, you may be very happy you did so. If you must use a DB, MySQL has a
table format that archives into ZIP format files. (It does not make indexes;
all searches are table scans.)

Lance

On Wed, Jan 28, 2009 at 12:37 PM, Ian Connor  wrote:

> Hi All,
>
> Is anyone using Solr (and thus the lucene index) as there database store.
>
> Up to now, we have been using a database to build Solr from. However, given
> that lucene already keeps the stored data intact, and that rebuilding from
> solr to solr can be very fast, the need for the separate database does not
> seem so necessary.
>
> It seems totally possible to maintain just the solr shards and treat them
> as
> the database (backups, redundancy, etc are already built right in). The
> idea
> that we would need to rebuild from scratch seems unlikely and the speed
> boost by using solr shards for data massaging and reindexing seems very
> appealing.
>
> Has anyone else thought about this or done this and ran into problems that
> caused them to go back to a seperate database model? Is there a critical
> need you can think is missing?
>
> --
> Regards,
>
> Ian Connor
>



-- 
Lance Norskog
goks...@gmail.com
650-922-8831 (US)


Re: solr as the data store

2009-01-30 Thread Paul Libbrecht
We've been using a Lucene index as the main data-store for ActiveMath,  
the indexing process of which takes the XML fragments apart and stores  
them in an organized way, including storage of the relationships both  
ways.


The difference between SQL and Lucene in this case? Pure java was the  
major reason back then. The performance of Lucene stayed top as well  
(compared to XML databases).


As of now because of 2.0, we had to split out the storage of the  
fragments themselves, keeping the rest in Lucene, because the  
functionality to reliably read and write fields and never have them be  
loaded as single strings has been missing us. Maybe it's back in 2.3...


Our fragments' size vary from 20 byte to 2 MBytes... about 25k of them  
is normal.


I'm looking forward to, one day, recycle it all to solr which would  
finally take care of it all in terms of index update and read  
management, adding a Luke-like web-access.


Scalability of Lucene has always been top.
Joins are not there... I could get along without them.
Summaries are also not really there... but again, we could get along  
without them.


paul


Le 28-janv.-09 à 21:37, Ian Connor a écrit :


Hi All,

Is anyone using Solr (and thus the lucene index) as there database  
store.


Up to now, we have been using a database to build Solr from.  
However, given
that lucene already keeps the stored data intact, and that  
rebuilding from
solr to solr can be very fast, the need for the separate database  
does not

seem so necessary.

It seems totally possible to maintain just the solr shards and treat  
them as
the database (backups, redundancy, etc are already built right in).  
The idea
that we would need to rebuild from scratch seems unlikely and the  
speed
boost by using solr shards for data massaging and reindexing seems  
very

appealing.

Has anyone else thought about this or done this and ran into  
problems that
caused them to go back to a seperate database model? Is there a  
critical

need you can think is missing?

--
Regards,

Ian Connor




smime.p7s
Description: S/MIME cryptographic signature


Re: solr as the data store

2009-01-30 Thread Ian Connor
The other option was actually couchdb. It was very nice but the benefits
were not compelling compared to the pure simplicity of just having solr.

With the replication just so simple to setup now - it really does seem to
solve all the problems we are looking for in a redundant distributed storage
solution.

On Thu, Jan 29, 2009 at 12:50 AM, Neal Richter  wrote:

> You might examine what the Apache CouchDB people have done.
>
> It's a document oriented DB that is able to use JSON structured
> documents combined with Lucene indexing of the documents with a
> RESTful HTTP interface.
>
> It's a stretch, and written in Erlang.. but perhaps there is some
> inspiration to be had for 'solr as the data store'.
>
> - Neal Richter
>



-- 
Regards,

Ian Connor


Re: solr as the data store

2009-01-28 Thread Neal Richter
You might examine what the Apache CouchDB people have done.

It's a document oriented DB that is able to use JSON structured
documents combined with Lucene indexing of the documents with a
RESTful HTTP interface.

It's a stretch, and written in Erlang.. but perhaps there is some
inspiration to be had for 'solr as the data store'.

- Neal Richter


Re: solr as the data store

2009-01-28 Thread Erick Erickson
But do note that there's also no requirement that all documents
have the same fields. So you could consider storing a special
"meta document" that had *no* fields in common with any other
document that records whatever information you want about the
current state of the index.

Best
Erick

On Wed, Jan 28, 2009 at 5:15 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> There is no existing internal field like that.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: Ian Connor 
> > To: solr-user@lucene.apache.org
> > Sent: Wednesday, January 28, 2009 4:59:28 PM
> > Subject: Re: solr as the data store
> >
> > I am planning with backups, the recovery will only be incremental.
> >
> > Is there an internal field to know when the last document hit the index
> or
> > is this best to build your own "created_at" type field to know when you
> need
> > to rebuild from?
> >
> > After the backup is restored, this field could be read and then the
> restore
> > from that time could kick in.
> >
> > On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd wrote:
> >
> > > Although the idea that you will need to rebuild from scratch is
> > > unlikely, you might want to fully understand the cost of recovery if
> you
> > > *do* have to.
> > >
> > > If it's incredibly expensive(time or money), you need to keep that in
> > > mind.
> > >
> > > -Todd
> > >
> > >
> > > -Original Message-
> > > From: Ian Connor [mailto:ian.con...@gmail.com]
> > > Sent: Wednesday, January 28, 2009 12:38 PM
> > > To: solr
> > > Subject: solr as the data store
> > >
> > > Hi All,
> > >
> > > Is anyone using Solr (and thus the lucene index) as there database
> > > store.
> > >
> > > Up to now, we have been using a database to build Solr from. However,
> > > given
> > > that lucene already keeps the stored data intact, and that rebuilding
> > > from
> > > solr to solr can be very fast, the need for the separate database does
> > > not
> > > seem so necessary.
> > >
> > > It seems totally possible to maintain just the solr shards and treat
> > > them as
> > > the database (backups, redundancy, etc are already built right in). The
> > > idea
> > > that we would need to rebuild from scratch seems unlikely and the speed
> > > boost by using solr shards for data massaging and reindexing seems very
> > > appealing.
> > >
> > > Has anyone else thought about this or done this and ran into problems
> > > that
> > > caused them to go back to a seperate database model? Is there a
> critical
> > > need you can think is missing?
> > >
> > > --
> > > Regards,
> > >
> > > Ian Connor
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Ian Connor
> > 1 Leighton St #723
> > Cambridge, MA 02141
> > Call Center Phone: +1 (714) 239 3875 (24 hrs)
> > Fax: +1(770) 818 5697
> > Skype: ian.connor
>
>


Re: solr as the data store

2009-01-28 Thread Otis Gospodnetic
There is no existing internal field like that.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Ian Connor 
> To: solr-user@lucene.apache.org
> Sent: Wednesday, January 28, 2009 4:59:28 PM
> Subject: Re: solr as the data store
> 
> I am planning with backups, the recovery will only be incremental.
> 
> Is there an internal field to know when the last document hit the index or
> is this best to build your own "created_at" type field to know when you need
> to rebuild from?
> 
> After the backup is restored, this field could be read and then the restore
> from that time could kick in.
> 
> On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd wrote:
> 
> > Although the idea that you will need to rebuild from scratch is
> > unlikely, you might want to fully understand the cost of recovery if you
> > *do* have to.
> >
> > If it's incredibly expensive(time or money), you need to keep that in
> > mind.
> >
> > -Todd
> >
> >
> > -Original Message-
> > From: Ian Connor [mailto:ian.con...@gmail.com]
> > Sent: Wednesday, January 28, 2009 12:38 PM
> > To: solr
> > Subject: solr as the data store
> >
> > Hi All,
> >
> > Is anyone using Solr (and thus the lucene index) as there database
> > store.
> >
> > Up to now, we have been using a database to build Solr from. However,
> > given
> > that lucene already keeps the stored data intact, and that rebuilding
> > from
> > solr to solr can be very fast, the need for the separate database does
> > not
> > seem so necessary.
> >
> > It seems totally possible to maintain just the solr shards and treat
> > them as
> > the database (backups, redundancy, etc are already built right in). The
> > idea
> > that we would need to rebuild from scratch seems unlikely and the speed
> > boost by using solr shards for data massaging and reindexing seems very
> > appealing.
> >
> > Has anyone else thought about this or done this and ran into problems
> > that
> > caused them to go back to a seperate database model? Is there a critical
> > need you can think is missing?
> >
> > --
> > Regards,
> >
> > Ian Connor
> >
> 
> 
> 
> -- 
> Regards,
> 
> Ian Connor
> 1 Leighton St #723
> Cambridge, MA 02141
> Call Center Phone: +1 (714) 239 3875 (24 hrs)
> Fax: +1(770) 818 5697
> Skype: ian.connor



Re: solr as the data store

2009-01-28 Thread Ian Connor
I am planning with backups, the recovery will only be incremental.

Is there an internal field to know when the last document hit the index or
is this best to build your own "created_at" type field to know when you need
to rebuild from?

After the backup is restored, this field could be read and then the restore
from that time could kick in.

On Wed, Jan 28, 2009 at 4:34 PM, Feak, Todd  wrote:

> Although the idea that you will need to rebuild from scratch is
> unlikely, you might want to fully understand the cost of recovery if you
> *do* have to.
>
> If it's incredibly expensive(time or money), you need to keep that in
> mind.
>
> -Todd
>
>
> -Original Message-
> From: Ian Connor [mailto:ian.con...@gmail.com]
> Sent: Wednesday, January 28, 2009 12:38 PM
> To: solr
> Subject: solr as the data store
>
> Hi All,
>
> Is anyone using Solr (and thus the lucene index) as there database
> store.
>
> Up to now, we have been using a database to build Solr from. However,
> given
> that lucene already keeps the stored data intact, and that rebuilding
> from
> solr to solr can be very fast, the need for the separate database does
> not
> seem so necessary.
>
> It seems totally possible to maintain just the solr shards and treat
> them as
> the database (backups, redundancy, etc are already built right in). The
> idea
> that we would need to rebuild from scratch seems unlikely and the speed
> boost by using solr shards for data massaging and reindexing seems very
> appealing.
>
> Has anyone else thought about this or done this and ran into problems
> that
> caused them to go back to a seperate database model? Is there a critical
> need you can think is missing?
>
> --
> Regards,
>
> Ian Connor
>



-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


RE: solr as the data store

2009-01-28 Thread Feak, Todd
Although the idea that you will need to rebuild from scratch is
unlikely, you might want to fully understand the cost of recovery if you
*do* have to.

If it's incredibly expensive(time or money), you need to keep that in
mind.

-Todd


-Original Message-
From: Ian Connor [mailto:ian.con...@gmail.com] 
Sent: Wednesday, January 28, 2009 12:38 PM
To: solr
Subject: solr as the data store

Hi All,

Is anyone using Solr (and thus the lucene index) as there database
store.

Up to now, we have been using a database to build Solr from. However,
given
that lucene already keeps the stored data intact, and that rebuilding
from
solr to solr can be very fast, the need for the separate database does
not
seem so necessary.

It seems totally possible to maintain just the solr shards and treat
them as
the database (backups, redundancy, etc are already built right in). The
idea
that we would need to rebuild from scratch seems unlikely and the speed
boost by using solr shards for data massaging and reindexing seems very
appealing.

Has anyone else thought about this or done this and ran into problems
that
caused them to go back to a seperate database model? Is there a critical
need you can think is missing?

-- 
Regards,

Ian Connor


Re: solr as the data store

2009-01-28 Thread Otis Gospodnetic
This is perfectly fine.  Of course, you lose any relational model.  If you 
don't have or don't need one, why not.

It used to be the case that backups of live Lucene indices were hard, so people 
preferred having a RDBMS be the primary data source, the one they know how to 
back up and maintain well.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Ian Connor 
> To: solr 
> Sent: Wednesday, January 28, 2009 3:37:55 PM
> Subject: solr as the data store
> 
> Hi All,
> 
> Is anyone using Solr (and thus the lucene index) as there database store.
> 
> Up to now, we have been using a database to build Solr from. However, given
> that lucene already keeps the stored data intact, and that rebuilding from
> solr to solr can be very fast, the need for the separate database does not
> seem so necessary.
> 
> It seems totally possible to maintain just the solr shards and treat them as
> the database (backups, redundancy, etc are already built right in). The idea
> that we would need to rebuild from scratch seems unlikely and the speed
> boost by using solr shards for data massaging and reindexing seems very
> appealing.
> 
> Has anyone else thought about this or done this and ran into problems that
> caused them to go back to a seperate database model? Is there a critical
> need you can think is missing?
> 
> -- 
> Regards,
> 
> Ian Connor



Re: solr as the data store

2009-01-28 Thread Matthew Runo
One thing to keep in mind is that things like joins are impossible in  
solr, but easy in a database. So if you ever need to do stuff like run  
reports, you're probably better off with a database to query on -  
unless you cover your bases very well in the solr index.


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Jan 28, 2009, at 12:37 PM, Ian Connor wrote:


Hi All,

Is anyone using Solr (and thus the lucene index) as there database  
store.


Up to now, we have been using a database to build Solr from.  
However, given
that lucene already keeps the stored data intact, and that  
rebuilding from
solr to solr can be very fast, the need for the separate database  
does not

seem so necessary.

It seems totally possible to maintain just the solr shards and treat  
them as
the database (backups, redundancy, etc are already built right in).  
The idea
that we would need to rebuild from scratch seems unlikely and the  
speed
boost by using solr shards for data massaging and reindexing seems  
very

appealing.

Has anyone else thought about this or done this and ran into  
problems that
caused them to go back to a seperate database model? Is there a  
critical

need you can think is missing?

--
Regards,

Ian Connor