GData

2006-04-25 Thread jason rutherglen
http://jeremy.zawodny.com/blog/archives/006687.html

Here is a good blog entry with a talk on GData from someone who worked on it.  
The only thing I think Solr needs is faster replication, which perhaps can be 
done faster using a direct replication model, preferably over HTTP of the 
segments files instead of rsync?  Reserving rsync for the optimized index sync. 
 The only other thing GData does is versioning of the documents.  



Re: GData

2006-04-25 Thread Yonik Seeley
On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it. 
>  The only thing I think Solr needs is faster replication, which perhaps can 
> be done faster using a direct replication model, preferably over HTTP of the 
> segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData 
> does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik


Re: GData

2006-04-25 Thread jason rutherglen
Also they have created what looks like fine grained date based queries in use 
with the Calendar application.  Perhaps having a predefined out of the box way 
of handling date queries using date ranges in Solr would be useful.  

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Tuesday, April 25, 2006 12:42:58 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it. 
>  The only thing I think Solr needs is faster replication, which perhaps can 
> be done faster using a direct replication model, preferably over HTTP of the 
> segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData 
> does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik





Re: GData

2006-04-25 Thread Erik Hatcher

Anyone here an old timer Apple Newton user?

I've been really getting jazzed on the ideas I'm getting thanks to  
Solr and contemplating Ruby integration.  I've been re-reading my  
dusty "Programming for the Newton" (using Windows!) book.  The  
discussion of the Newton "soup" data storage mechanism is very much  
on track with what I'd like to implement from the Ruby side of things  
using Solr as the "soups" storage.   I think more needs to be done  
with Solr than just faster replication to enable a flexible schema  
scenario.  Back to the Newton analogy, each application registers its  
own schema but everything fits into a common storage system allowing  
a unified querying mechanism.  Merging queries/data across soups is  
not done except at the application level, but I can see in the Solr  
case that custom handlers can facilitate this sort of thing to free  
the client from having to deal with the massive amount of data.


I've been mulling over the idea of having a single Solr instance  
morph into system that can handle multiple client-defined schemas  
(why not?  Lucene itself can handle it) rather than a static XML file  
and allow the schemas themselves to be retrievable (yes, I know it  
already is).  I'm still talking about a single Lucene index, but with  
each Document given a "soup" name field and filters automatically  
available to single out a specific soup.


Make sense?  I think the GData thing fits with the loosely defined  
schema scenario as well.


Thoughts?

I was going to wait until my thoughts were more gelled on this topic,  
but the GData thread brought me out of my cave earlier.


Erik



On Apr 25, 2006, at 3:16 PM, jason rutherglen wrote:


http://jeremy.zawodny.com/blog/archives/006687.html

Here is a good blog entry with a talk on GData from someone who  
worked on it.  The only thing I think Solr needs is faster  
replication, which perhaps can be done faster using a direct  
replication model, preferably over HTTP of the segments files  
instead of rsync?  Reserving rsync for the optimized index sync.   
The only other thing GData does is versioning of the documents.






[no subject]

2006-04-25 Thread jason rutherglen
http://code.google.com/apis/gdata/protocol.html#Optimistic-concurrency

The versioning is for updates only.  



Re: GData

2006-04-25 Thread Chris Hostetter

: I've been mulling over the idea of having a single Solr instance
: morph into system that can handle multiple client-defined schemas
: (why not?  Lucene itself can handle it) rather than a static XML file
: and allow the schemas themselves to be retrievable (yes, I know it
: already is).  I'm still talking about a single Lucene index, but with
: each Document given a "soup" name field and filters automatically
: available to single out a specific soup.

Given the flexability of dynamicFields, i think we're 99% of the way there
-- all we'd need is support for  and
then you could define a "soup" schema with nothing but dynamic fields
(one per datatype/stored/index triple you care about) and a few common
fields for partitioning and generic text searching.


-Hoss



Re: GData

2006-04-25 Thread jason rutherglen
Ok, if Google is using the GData architecture to store the GCalendar data, 
assuming they are, how long do you think a write takes to show up on the 
GCalendar web site?  I think in this case something other than rsync may be a 
better option.

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Tuesday, April 25, 2006 12:42:58 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Here is a good blog entry with a talk on GData from someone who worked on it. 
>  The only thing I think Solr needs is faster replication, which perhaps can 
> be done faster using a direct replication model, preferably over HTTP of the 
> segments files instead of rsync?

rsync should be very fast if you configure it to not checksum the
files, and just go by timestamp and size.  It will only transfer the
changed segments.  We get very good performance with this model.

>  Reserving rsync for the optimized index sync.  The only other thing GData 
> does is
> versioning of the documents.

Hmmm, that might require some thought...  I guess it depends on what
GData allows you to do with the different versions.

-Yonik





Re: GData

2006-04-25 Thread Ian Holsman
I noticed you guys have created a 'gdata-lucene' server in the SoC project.
are you planning on doing this via SoLR? or is it something brand new?

--i

On 4/26/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Ok, if Google is using the GData architecture to store the GCalendar data, 
> assuming they are, how long do you think a write takes to show up on the 
> GCalendar web site?  I think in this case something other than rsync may be a 
> better option.
>
> - Original Message 
> From: Yonik Seeley <[EMAIL PROTECTED]>
> To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on 
> > it.  The only thing I think Solr needs is faster replication, which perhaps 
> > can be done faster using a direct replication model, preferably over HTTP 
> > of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData 
> > does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
[EMAIL PROTECTED] -- blog: http://feh.holsman.net/ -- PH: ++61-3-9818-0132

If everything seems under control, you're not going fast enough. -
Mario Andretti


Re: GData

2006-04-25 Thread Yonik Seeley
On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Ok, if Google is using the GData architecture to store the GCalendar data, 
> assuming they are, how long do you think a write takes to show up on the 
> GCalendar web site?  I think in this case something other than rsync may be a 
> better option.

rsync is just used as a replication transport, and I don't think it's
the limiting factor.

Opening a new IndexSearcher in Lucene is a relatively expensive
operation, esp when you factor in populating the fieldCache and field
norms.  You shouldn't be doing it too often (once a minute maybe).

If updates need to be immediately visible in conjunction with a high
update rate, a database is a better solution.

For Solr, I'd solve GData for the single-server case first, then go
about figuring out replication requirements.



> - Original Message 
> From: Yonik Seeley <[EMAIL PROTECTED]>
> To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on 
> > it.  The only thing I think Solr needs is faster replication, which perhaps 
> > can be done faster using a direct replication model, preferably over HTTP 
> > of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData 
> > does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


Re: GData

2006-04-25 Thread Doug Cutting

Ian Holsman wrote:

I noticed you guys have created a 'gdata-lucene' server in the SoC project.
are you planning on doing this via SoLR? or is it something brand new?


We decided that doing this via Solr would probably make it more 
complicated.  A simple, standalone GData server built just using just 
Lucene is what we had in mind for the SoC project.  This could then 
become a Lucene contrib module.


Doug


Re: GData

2006-04-25 Thread jason rutherglen
I would be curious then how the Google architecture works given that it seems 
to combine search and database concepts together and the Adam Bosworth talk 
seems to imply a replication redundant architecture like Solr.  Is a faster 
method of loading or updating the IndexSearcher something that makes sense for 
Lucene?  Or just assume the Google architecture is a lot more complex.

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Tuesday, April 25, 2006 3:21:07 PM
Subject: Re: GData

On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> Ok, if Google is using the GData architecture to store the GCalendar data, 
> assuming they are, how long do you think a write takes to show up on the 
> GCalendar web site?  I think in this case something other than rsync may be a 
> better option.

rsync is just used as a replication transport, and I don't think it's
the limiting factor.

Opening a new IndexSearcher in Lucene is a relatively expensive
operation, esp when you factor in populating the fieldCache and field
norms.  You shouldn't be doing it too often (once a minute maybe).

If updates need to be immediately visible in conjunction with a high
update rate, a database is a better solution.

For Solr, I'd solve GData for the single-server case first, then go
about figuring out replication requirements.



> - Original Message 
> From: Yonik Seeley <[EMAIL PROTECTED]>
> To: solr-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
> Sent: Tuesday, April 25, 2006 12:42:58 PM
> Subject: Re: GData
>
> On 4/25/06, jason rutherglen <[EMAIL PROTECTED]> wrote:
> > Here is a good blog entry with a talk on GData from someone who worked on 
> > it.  The only thing I think Solr needs is faster replication, which perhaps 
> > can be done faster using a direct replication model, preferably over HTTP 
> > of the segments files instead of rsync?
>
> rsync should be very fast if you configure it to not checksum the
> files, and just go by timestamp and size.  It will only transfer the
> changed segments.  We get very good performance with this model.
>
> >  Reserving rsync for the optimized index sync.  The only other thing GData 
> > does is
> > versioning of the documents.
>
> Hmmm, that might require some thought...  I guess it depends on what
> GData allows you to do with the different versions.
>
> -Yonik
>
>
>
>
>


--
-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server





Re: GData

2006-04-25 Thread Doug Cutting

jason rutherglen wrote:

Is a faster method of loading or updating the IndexSearcher something that 
makes sense for Lucene?


Yes.  Folks have developed incrementally updateable IndexSearchers 
before, but none is yet part of Lucene.



 Or just assume the Google architecture is a lot more complex.


That's probably a safe assumption.  Their architecture is designed to 
support real-time things like calendars, email, etc.  Search engines, 
Lucene's normal domain, are not usually real-time, but have indexing delays.


Doug


Re: GData

2006-04-25 Thread jason rutherglen
I tried the find this Nutch answer in the docs and mailing list, sorry that 
it's a bit naive.  Assuming Nutch distributes the index over many machines, 
does it use the NutchFS as a the Directory for IndexSearcher or does not use 
RemoteMultiSearcher?  

> support real-time things like calendars, email, etc.  Search engines, 
> Lucene's normal domain, are not usually real-time, but have indexing delays.

True, however it may be an interesting direction to go in.  They seem to make 
the information nearly immediately searchable.  Surely we can do the same.

- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Tuesday, April 25, 2006 4:10:36 PM
Subject: Re: GData

jason rutherglen wrote:
> Is a faster method of loading or updating the IndexSearcher something that 
> makes sense for Lucene?

Yes.  Folks have developed incrementally updateable IndexSearchers 
before, but none is yet part of Lucene.

>  Or just assume the Google architecture is a lot more complex.

That's probably a safe assumption.  Their architecture is designed to 
support real-time things like calendars, email, etc.  Search engines, 
Lucene's normal domain, are not usually real-time, but have indexing delays.

Doug





Re: GData

2006-04-25 Thread jason rutherglen
Ah ok, think I found it: org.apache.nutch.indexer.FsDirectory no?

Couldn't this be used in Solr and distribute all the data rather than 
master/slave it?

- Original Message 
From: Doug Cutting <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Tuesday, April 25, 2006 4:10:36 PM
Subject: Re: GData

jason rutherglen wrote:
> Is a faster method of loading or updating the IndexSearcher something that 
> makes sense for Lucene?

Yes.  Folks have developed incrementally updateable IndexSearchers 
before, but none is yet part of Lucene.

>  Or just assume the Google architecture is a lot more complex.

That's probably a safe assumption.  Their architecture is designed to 
support real-time things like calendars, email, etc.  Search engines, 
Lucene's normal domain, are not usually real-time, but have indexing delays.

Doug





Re: GData

2006-04-25 Thread Doug Cutting

jason rutherglen wrote:

Ah ok, think I found it: org.apache.nutch.indexer.FsDirectory no?

Couldn't this be used in Solr and distribute all the data rather than 
master/slave it?


It's possible to search a Lucene index that lives in Hadoop's DFS, but 
not recommended.  It's very slow.  It's much faster to copy the index to 
a local drive.


The rsync approach, of only transmitting index diffs, is a very 
efficient way to distribute an index.  In particular, it supports 
scaling the number of *readers* well.


For read/write stuff (e.g. a calendar) such scaling might not be 
paramount.  Rather, you might be happy to route all requests for a 
particular calendar to a particular server.  The index/database could 
still be somehow replicated/synced, in case that server dies, but a 
single server can probably handle all requests for a particular 
index/database.  And keeping things coherent is much simpler in this case.


Doug