Re: [PROPOSAL] index server project

2006-11-21 Thread Bob Carpenter

Doug Cutting wrote:
It seems that Nutch and Solr would benefit from a shared index serving 
infrastructure.  

> ...
An RPC mechanism would be used to communicate between nodes (probably 
Hadoop's).  The system would be configured with a single master node 
that keeps track of where indexes are located, and a number of slave 
nodes that would maintain, search and replicate indexes.  Clients would 
talk to the master to find out which indexes to search or update, then 
they'll talk directly to slaves to perform searches and updates.

...
Does this make sense?  Does it sound like it would be useful to Solr? To 
Nutch?  To others?  Who would be interested and able to work on it?


Is there any way this could be generalized so that resources
other than Lucene indexes could be packaged up and distributed?


The reason I ask is that we have customers who are using
Lucene and SOLR and would like to pass other bits of their
applications around in the same way, including things we've
built from indexed data like spelling checkers, background
models for statistically interesting phrase detectors, statistical
models for topic/tag classifiers that get retrained as users
add more tags, language identifiers, etc.

From what I understand of Doug's proposal as well as
what I've seen in SOLR, there's not much that's actually
Lucene-specific about all this client/master/slave synching
other than that the data's a Lucene index.

I imagine this could be done with a generalization of the
kinds of callbacks found in SOLR, or by making what gets
passed around configurable in the proposed index server
project.

I'd be happy to test and help with API-level design/doc;  I
don't know much about distribution mechanics, though, which
is why I'm so interested in this high level abstraction.

- Bob Carpenter
  Alias-i


Re: [PROPOSAL] index server project

2006-11-06 Thread Stefan Groschupf

Hi,

do people think we are already in a stage where we can setup some  
basic infrastructure like mailing list and wiki and move the  
discussion to the new mailing list. Maybe setup a incubator project?


I would be happy to help with such basic tasks.

Stefan



Am 31.10.2006 um 22:03 schrieb Yonik Seeley:


On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> We assume that, within an index, a file with a given name is  
written

>> only once.
>
> Is this necessary, and will we need the lockless patch (that avoids
> renaming or rewriting *any* files), or is Lucene's current index
> behavior sufficient?

It's not strictly required, but it would make index synchronization a
lot simpler. Yes, I was assuming the lockless patch would be  
committed
to Lucene before this project gets very far.  Something more than  
that

would be required in order to keep old versions, but this could be as
simple as a Directory subclass that refuses to remove files for a  
time.


Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is
committed).

> It's unfortunate the master needs to be involved on every  
document add.


That should not normally be the case.


Ahh... I had assumed that "id" in the following method was document  
id:

 IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id
argument.


I was not imagining a real-time system, where the next query after a
document is added would always include that document.  Is that a
requirement?  That's harder.


Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.

At this point I'm mostly trying to see if this functionality would  
meet

the needs of Solr, Nutch and others.



It depends on the project scope and how extensible things are.
It seems like the master would be a WAR, capable of running stand- 
alone.

What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).

Must we include a notion of document identity and/or document  
version in

the mechanism? Would that facillitate updates and coherency?


It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.

-Yonik



~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: [PROPOSAL] index server project

2006-10-31 Thread Yonik Seeley

On 10/30/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>> We assume that, within an index, a file with a given name is written
>> only once.
>
> Is this necessary, and will we need the lockless patch (that avoids
> renaming or rewriting *any* files), or is Lucene's current index
> behavior sufficient?

It's not strictly required, but it would make index synchronization a
lot simpler. Yes, I was assuming the lockless patch would be committed
to Lucene before this project gets very far.  Something more than that
would be required in order to keep old versions, but this could be as
simple as a Directory subclass that refuses to remove files for a time.


Or a snapshot (hard links) mechanism.
Lucene would also need a way to open a specific index version (rather
than just the latest), but I guess that could also be hacked into
Directory by hiding later "segments" files (assumes lockless is
committed).


> It's unfortunate the master needs to be involved on every document add.

That should not normally be the case.


Ahh... I had assumed that "id" in the following method was document id:
 IndexLocation getUpdateableIndex(String id);

I see now it's index id.

But what is index id exactly?  Looking at the example API you laid
down, it must be a single physical index (as opposed to a logical
index).  In which case, is it entirely up to the client to manage
multi-shard indicies?  For example, if we had a "photo" index broken
up into 3 shards, each shard would have a separate index id and it
would be up to the client to know this, and to query across the
different "photo0", "photo1", "photo2" indicies.  The master would
have no clue those indicies were related.  Hmmm, that doesn't work
very well for deletes though.

It seems like there should be the concept of a logical index, that is
composed of multiple shards, and each shard has multiple copies.

Or were you thinking that a cluster would only contain a single
logical index, and hence all different index ids are simply different
shards of that single logical index?  That would seem to be consistent
with ClientToMasterProtocol .getSearchableIndexes() lacking an id
argument.


I was not imagining a real-time system, where the next query after a
document is added would always include that document.  Is that a
requirement?  That's harder.


Not real-time, but it would be nice if we kept it close to what Lucene
can currently provide.
Most people seem fine with a latency of minutes.


At this point I'm mostly trying to see if this functionality would meet
the needs of Solr, Nutch and others.



It depends on the project scope and how extensible things are.
It seems like the master would be a WAR, capable of running stand-alone.
What about index servers (slaves)?  Would this project include just
the interfaces to be implemented by Solr/Nutch nodes, some common
implementation code behind the interfaces in the form of a library, or
also complete standalone WARs?

I'd need to be able to extend the ClientToSlave protocol to add
additional methods for Solr (for passing in extra parameters and
returning various extra data such as facets, highlighting, etc).


Must we include a notion of document identity and/or document version in
the mechanism? Would that facillitate updates and coherency?


It doesn't need to be in the interfaces I don't think, so it depends
on the scope of the index server implementations.

-Yonik


Re: [PROPOSAL] index server project

2006-10-30 Thread Doug Cutting

Yonik Seeley wrote:

On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

We assume that, within an index, a file with a given name is written
only once.


Is this necessary, and will we need the lockless patch (that avoids
renaming or rewriting *any* files), or is Lucene's current index
behavior sufficient?


It's not strictly required, but it would make index synchronization a 
lot simpler.  Yes, I was assuming the lockless patch would be committed 
to Lucene before this project gets very far.  Something more than that 
would be required in order to keep old versions, but this could be as 
simple as a Directory subclass that refuses to remove files for a time.



The search side seems straightforward enough, but I haven't totally
figured out how the update side should work.


The master should be out of the loop as much as possible.  One approach 
is that clients randomly assign documents to indexes and send the 
updates directly to the indexing node.  Alternately, clients might index 
locally, then ship the updates to a node packaged as an index.  That was 
the intent of the addIndex method.



One potental problem is a document overwrite implemented as a delete
then an add.
More than one client doing this for the same document could result in
0 or 2 documents, instead of 1.  I guess clients will just need to be
relatively coordinated in their activities.


Good point.  Either the two clients must coordinate, to make sure that 
they're not updating the same document at the same time, or use a 
strategy where updates are routed to the slave that contained the old 
version of the document.  That would require a broadcast query to figure 
out which slave that is.



It's unfortunate the master needs to be involved on every document add.


That should not normally be the case.  Clients can cache the set of 
writable index locations and directly submit new documents without 
involving the master.



If deletes were broadcast, and documents could go to any partition,
that would be one way around it (with the downside of a less powerful
master that could implement certain distribution policies).
Another way to lessen the master-in-the-middle cost is to make sure
one can aggregate small requests:
   IndexLocation[] getUpdateableIndex(String[] id);


I'd assumed that the updateable version of an index does not move around 
very often.  Perhaps a lease mechanism is required.  For example, a call 
to getUpdateableIndex might be valid for ten minutes.


We might consider a delete() on the master interface too.  That way it 
could

 3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
2) potentially do some batching of deletes
1) simply do the delete locally if there is a single index partition
and this is a combination master/searcher


I'm reticent to put any frequently-made call on the master.  I'd prefer 
to keep the master only involved at an executive level, with all 
per-document and per-query traffic going directly from client to slave.



It seems like the master might want to be involved in commits too, or
maybe we just rely on the slave to master heartbeat to kick of
immediately after a commit so that index replication can be initiated?


I like the latter approach.  New versions are only published as 
frequently as clients poll the master for updated IndexLocations. 
Clients keep a cache of both readable and updatable index locations that 
are periodically refreshed.


I was not imagining a real-time system, where the next query after a 
document is added would always include that document.  Is that a 
requirement?  That's harder.


At this point I'm mostly trying to see if this functionality would meet 
the needs of Solr, Nutch and others.


Must we include a notion of document identity and/or document version in 
the mechanism?  Would that facillitate updates and coherency?


In Nutch a typical case is that you have a bunch of URLs with content 
that may-or-may-not have been previously indexed.  The approach I'm 
currently leaning towards is that we'd broadcast the deletions of all of 
these to all slaves, then add index them to randomly assigned indexes. 
In Nutch multiple clients would naturally be coordinated, since each url 
is represented only once in each update cycle.


Doug


Re: [PROPOSAL] index server project

2006-10-21 Thread Yonik Seeley

On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

We assume that, within an index, a file with a given name is written
only once.


Is this necessary, and will we need the lockless patch (that avoids
renaming or rewriting *any* files), or is Lucene's current index
behavior sufficient?

I like the explicit index version and keeping the last few version
around.  The whole idea of a master seems to lessen the amount of
manual configuration in large clusters too.
The search side seems straightforward enough, but I haven't totally
figured out how the update side should work.


Deletions could be broadcast to all slaves.  That would probably be fast
enough.


Hmmm, that does allow one to move documents around the cluster and
more easily resize things.

One potental problem is a document overwrite implemented as a delete
then an add.
More than one client doing this for the same document could result in
0 or 2 documents, instead of 1.  I guess clients will just need to be
relatively coordinated in their activities.


 Alternately, indexes could be partitioned by a hash of each
document's unique id, permitting deletions to be routed to the
appropriate slave.


A hash is nice, but then you can't resize the number of partitions
your index is split into.

It's unfortunate the master needs to be involved on every document add.
If deletes were broadcast, and documents could go to any partition,
that would be one way around it (with the downside of a less powerful
master that could implement certain distribution policies).
Another way to lessen the master-in-the-middle cost is to make sure
one can aggregate small requests:
   IndexLocation[] getUpdateableIndex(String[] id);

We might consider a delete() on the master interface too.  That way it could
 3) hide the delete policy (broadcast or directl-to-server-that-has-doc)
2) potentially do some batching of deletes
1) simply do the delete locally if there is a single index partition
and this is a combination master/searcher

It seems like the master might want to be involved in commits too, or
maybe we just rely on the slave to master heartbeat to kick of
immediately after a commit so that index replication can be initiated?


Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?


Still interested, and able :-)

-Yonik


Re: [PROPOSAL] index server project

2006-10-20 Thread Stefan Groschupf

Hi,

The major goal is scale, right? A distributed server provides more  
oomph

than a single-node server can.


Another important goal from my point of view would be index  
management, like index updates during production.


Stefan 


Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
Damn Y! mail shortcut.
The link to the project is in my Lucene group:  http://www.simpy.com/group/363

Otis

- Original Message 
From: Alexandru Popescu <[EMAIL PROTECTED]>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> > spamming you.  If it sounds interesting, please reply there.  My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> >  Original Message 
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <[EMAIL PROTECTED]>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure.  Other Lucene-based projects might also benefit from
> > this.  So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's).  The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes.  Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once.  Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version.  Versions
> > are numbered.  An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> >   String Id;   // unique name of the index
> >   int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> >   IndexVersion indexVersion;
> >   InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> >   IndexLocation[] getSearchableIndexes();
> >   IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> >   // normal update
> >   void addDocument(String index, Document doc);
> >   int[] removeDocuments(String index, Term term);
> >   void commitVersion(String index);
> >
> >   // batch update
> >   void addIndex(String index, IndexLocation indexToAdd);
> >
> >   // search
> >   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> >   // sends currently searchable indexes
> >   // recieves updated indexes that we should replicate/update
> >   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> >   String[] getFileSet(IndexVersion indexVersion);
> >   byte[] getFileContent(IndexVersion indexVersion, String file);
> >   // based on experience in Hadoop, we probably wouldn't really use
> >   // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves.  The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library c

Re: [Fwd: [PROPOSAL] index server project]

2006-10-20 Thread Otis Gospodnetic
That's distributed indexed, built on top of Sun Grid.  The project won a $50K 
prize.


- Original Message 
From: Alexandru Popescu <[EMAIL PROTECTED]>
To: general@lucene.apache.org
Sent: Thursday, October 19, 2006 10:19:00 AM
Subject: Re: [Fwd: [PROPOSAL] index server project]

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
> Hi Doug,
>
> we discussed the need of such a tool several times internally and
> developed some workarounds for nutch, so I would be definitely
> interested to contribute to such a project.
> Having a separated project that depends on hadoop would be the best
> case for our usecases.
>
> Best,
> Stefan
>
>
>
> Am 18.10.2006 um 23:35 schrieb Doug Cutting:
>
> > FYI, I just pitched a new project you might be interested in on
> > [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> > spamming you.  If it sounds interesting, please reply there.  My
> > management at Y! is interested in this, so I'm 'in'.
> >
> > Doug
> >
> >  Original Message 
> > Subject: [PROPOSAL] index server project
> > Date: Wed, 18 Oct 2006 14:17:30 -0700
> > From: Doug Cutting <[EMAIL PROTECTED]>
> > Reply-To: general@lucene.apache.org
> > To: general@lucene.apache.org
> >
> > It seems that Nutch and Solr would benefit from a shared index serving
> > infrastructure.  Other Lucene-based projects might also benefit from
> > this.  So perhaps we should start a new project to build such a thing.
> > This could start either in java/contrib, or as a separate sub-project,
> > depending on interest.
> >
> > Here are some quick ideas about how this might work.
> >
> > An RPC mechanism would be used to communicate between nodes (probably
> > Hadoop's).  The system would be configured with a single master node
> > that keeps track of where indexes are located, and a number of slave
> > nodes that would maintain, search and replicate indexes.  Clients
> > would
> > talk to the master to find out which indexes to search or update, then
> > they'll talk directly to slaves to perform searches and updates.
> >
> > Following is an outline of how this might look.
> >
> > We assume that, within an index, a file with a given name is written
> > only once.  Index versions are sets of files, and a new version of an
> > index is likely to share most files with the prior version.  Versions
> > are numbered.  An index server should keep old versions of each index
> > for a while, not immediately removing old files.
> >
> > public class IndexVersion {
> >   String Id;   // unique name of the index
> >   int version; // the version of the index
> > }
> >
> > public class IndexLocation {
> >   IndexVersion indexVersion;
> >   InetSocketAddress location;
> > }
> >
> > public interface ClientToMasterProtocol {
> >   IndexLocation[] getSearchableIndexes();
> >   IndexLocation getUpdateableIndex(String id);
> > }
> >
> > public interface ClientToSlaveProtocol {
> >   // normal update
> >   void addDocument(String index, Document doc);
> >   int[] removeDocuments(String index, Term term);
> >   void commitVersion(String index);
> >
> >   // batch update
> >   void addIndex(String index, IndexLocation indexToAdd);
> >
> >   // search
> >   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> > }
> >
> > public interface SlaveToMasterProtocol {
> >   // sends currently searchable indexes
> >   // recieves updated indexes that we should replicate/update
> >   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> > }
> >
> > public interface SlaveToSlaveProtocol {
> >   String[] getFileSet(IndexVersion indexVersion);
> >   byte[] getFileContent(IndexVersion indexVersion, String file);
> >   // based on experience in Hadoop, we probably wouldn't really use
> >   // RPC to send file content, but rather HTTP.
> > }
> >
> > The master thus maintains the set of indexes that are available for
> > search, keeps track of which slave should handle changes to an
> > index and
> > initiates index synchronization between slaves.  The master can be
> > configured to replicate indexes a specified number of times.
> >
> > The client library can cac

Re: [PROPOSAL] index server project

2006-10-19 Thread Yonik Seeley

On 10/19/06, Steven Parkes <[EMAIL PROTECTED]> wrote:

You mention partitioning of indexes, though mostly around delete. What
about scalability of corpus size?


Definitely in scope.  Solr already has scalability of search volume
via searchers behind of a load balancer all getting their index from a
master.  The problem comes when an index is too big to get decent
latency for a single query, and that's when you need to partiton the
index into "shards" to use google terminology.


Would partitioning be effective for
that, too?


Yes, to a certain extent.  At some point you run into network
bandwidth issues if you go deep into rankings.


What about scalability of ingest rate?


As it relates to indexing, I think nutch already has that base covered.


What are you thinking, in terms of size? Is this a 10 node thing?


I'm personally interested in perhaps 10 to 20 index shards, with
multiple replicas of each shard for HA and query load scalability.


A 1000
node thing? More? Bigger is cool, but raises a lot of issues.


Should be possible, but I won't personally be looking for that.  I
think scaling effectively will be partially in the hands of the client
and how it chooses to merge results from shards.


How
dynamic?



Can nodes come and go?


Unplanned: yes.  HA is personally key for me.
Planned (adding capacity gracefully): it would be nice.  I actually
hadn't planned it for Solr.


Are you going to assume homogeneity of
nodes?


Hardware homogeneity?  That might be out of scope... I'd start off
without worrying about it in any case.


What about add/modify/delete to search visibility latency? Close to
batch/once-a-day or real-time?


Anywhere in between I'd think.  "Realtime" latencies of minutes or
longer are normally fine.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


RE: [PROPOSAL] index server project

2006-10-19 Thread Steven Parkes
I like the idea. I'm trying to figure out, in broad strokes, the
overarching goals. Forgive me if this is obvious, I just want to be
clear.

The major goal is scale, right? A distributed server provides more oomph
than a single-node server can.

There are a number of dimensions in scale.

You mention replication of indexes, so scalability of search volume is
in scope, right?

You mention partitioning of indexes, though mostly around delete. What
about scalability of corpus size? Would partitioning be effective for
that, too?

What about scalability of ingest rate?

What are you thinking, in terms of size? Is this a 10 node thing? A 1000
node thing? More? Bigger is cool, but raises a lot of issues. How
dynamic? Can nodes come and go? Are you going to assume homogeneity of
nodes?

What about add/modify/delete to search visibility latency? Close to
batch/once-a-day or real-time?

I think it's definitely something people want. Actually, I think we
could answer these questions in different ways and for every answer,
we'd find people that would want it. But they would probably be
different people.


Re: [Fwd: [PROPOSAL] index server project]

2006-10-19 Thread Alexandru Popescu

I am not sure this is (somehow) related, but I think I have noticed
some project on a Sun contest (it was the big prize winner). I cannot
retrieve it now, but hopefully somebody else will.

./alex
--
.w( the_mindstorm )p.


On 10/19/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

Hi Doug,

we discussed the need of such a tool several times internally and
developed some workarounds for nutch, so I would be definitely
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best
case for our usecases.

Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

> FYI, I just pitched a new project you might be interested in on
> [EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm
> spamming you.  If it sounds interesting, please reply there.  My
> management at Y! is interested in this, so I'm 'in'.
>
> Doug
>
> ---- Original Message ----
> Subject: [PROPOSAL] index server project
> Date: Wed, 18 Oct 2006 14:17:30 -0700
> From: Doug Cutting <[EMAIL PROTECTED]>
> Reply-To: general@lucene.apache.org
> To: general@lucene.apache.org
>
> It seems that Nutch and Solr would benefit from a shared index serving
> infrastructure.  Other Lucene-based projects might also benefit from
> this.  So perhaps we should start a new project to build such a thing.
> This could start either in java/contrib, or as a separate sub-project,
> depending on interest.
>
> Here are some quick ideas about how this might work.
>
> An RPC mechanism would be used to communicate between nodes (probably
> Hadoop's).  The system would be configured with a single master node
> that keeps track of where indexes are located, and a number of slave
> nodes that would maintain, search and replicate indexes.  Clients
> would
> talk to the master to find out which indexes to search or update, then
> they'll talk directly to slaves to perform searches and updates.
>
> Following is an outline of how this might look.
>
> We assume that, within an index, a file with a given name is written
> only once.  Index versions are sets of files, and a new version of an
> index is likely to share most files with the prior version.  Versions
> are numbered.  An index server should keep old versions of each index
> for a while, not immediately removing old files.
>
> public class IndexVersion {
>   String Id;   // unique name of the index
>   int version; // the version of the index
> }
>
> public class IndexLocation {
>   IndexVersion indexVersion;
>   InetSocketAddress location;
> }
>
> public interface ClientToMasterProtocol {
>   IndexLocation[] getSearchableIndexes();
>   IndexLocation getUpdateableIndex(String id);
> }
>
> public interface ClientToSlaveProtocol {
>   // normal update
>   void addDocument(String index, Document doc);
>   int[] removeDocuments(String index, Term term);
>   void commitVersion(String index);
>
>   // batch update
>   void addIndex(String index, IndexLocation indexToAdd);
>
>   // search
>   SearchResults search(IndexVersion i, Query query, Sort sort, int n);
> }
>
> public interface SlaveToMasterProtocol {
>   // sends currently searchable indexes
>   // recieves updated indexes that we should replicate/update
>   public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
> }
>
> public interface SlaveToSlaveProtocol {
>   String[] getFileSet(IndexVersion indexVersion);
>   byte[] getFileContent(IndexVersion indexVersion, String file);
>   // based on experience in Hadoop, we probably wouldn't really use
>   // RPC to send file content, but rather HTTP.
> }
>
> The master thus maintains the set of indexes that are available for
> search, keeps track of which slave should handle changes to an
> index and
> initiates index synchronization between slaves.  The master can be
> configured to replicate indexes a specified number of times.
>
> The client library can cache the current set of searchable indexes and
> periodically refresh it.  Searches are broadcast to one index with
> each
> id and return merged results.  The client will load-balance both
> searches and updates.
>
> Deletions could be broadcast to all slaves.  That would probably be
> fast
> enough.  Alternately, indexes could be partitioned by a hash of each
> document's unique id, permitting deletions to be routed to the
> appropriate slave.
>
> Does this make sense?  Does it sound like it would be useful to Solr?
> To Nutch?  To others?  Who would be interested and able to work on it?
>
> Doug
>

~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com







Re: [Fwd: [PROPOSAL] index server project]

2006-10-19 Thread Stefan Groschupf

Hi Doug,

we discussed the need of such a tool several times internally and  
developed some workarounds for nutch, so I would be definitely  
interested to contribute to such a project.
Having a separated project that depends on hadoop would be the best  
case for our usecases.


Best,
Stefan



Am 18.10.2006 um 23:35 schrieb Doug Cutting:

FYI, I just pitched a new project you might be interested in on  
[EMAIL PROTECTED]  Dunno if you subscribe to that list, so I'm  
spamming you.  If it sounds interesting, please reply there.  My  
management at Y! is interested in this, so I'm 'in'.


Doug

 Original Message ----
Subject: [PROPOSAL] index server project
Date: Wed, 18 Oct 2006 14:17:30 -0700
From: Doug Cutting <[EMAIL PROTECTED]>
Reply-To: general@lucene.apache.org
To: general@lucene.apache.org

It seems that Nutch and Solr would benefit from a shared index serving
infrastructure.  Other Lucene-based projects might also benefit from
this.  So perhaps we should start a new project to build such a thing.
This could start either in java/contrib, or as a separate sub-project,
depending on interest.

Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably
Hadoop's).  The system would be configured with a single master node
that keeps track of where indexes are located, and a number of slave
nodes that would maintain, search and replicate indexes.  Clients  
would

talk to the master to find out which indexes to search or update, then
they'll talk directly to slaves to perform searches and updates.

Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written
only once.  Index versions are sets of files, and a new version of an
index is likely to share most files with the prior version.  Versions
are numbered.  An index server should keep old versions of each index
for a while, not immediately removing old files.

public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for
search, keeps track of which slave should handle changes to an  
index and

initiates index synchronization between slaves.  The master can be
configured to replicate indexes a specified number of times.

The client library can cache the current set of searchable indexes and
periodically refresh it.  Searches are broadcast to one index with  
each

id and return merged results.  The client will load-balance both
searches and updates.

Deletions could be broadcast to all slaves.  That would probably be  
fast

enough.  Alternately, indexes could be partitioned by a hash of each
document's unique id, permitting deletions to be routed to the
appropriate slave.

Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?

Doug



~~~
101tec Inc.
search tech for web 2.1
Menlo Park, California
http://www.101tec.com





Re: [PROPOSAL] index server project

2006-10-18 Thread Yonik Seeley

On 10/18/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Does this make sense?  Does it sound like it would be useful to Solr?
To Nutch?  To others?  Who would be interested and able to work on it?


Rather than holding my tounge until I wrap my head around all the
issues, I'll say that I'm definitely interested!

-Yonik


[PROPOSAL] index server project

2006-10-18 Thread Doug Cutting
It seems that Nutch and Solr would benefit from a shared index serving 
infrastructure.  Other Lucene-based projects might also benefit from 
this.  So perhaps we should start a new project to build such a thing. 
This could start either in java/contrib, or as a separate sub-project, 
depending on interest.


Here are some quick ideas about how this might work.

An RPC mechanism would be used to communicate between nodes (probably 
Hadoop's).  The system would be configured with a single master node 
that keeps track of where indexes are located, and a number of slave 
nodes that would maintain, search and replicate indexes.  Clients would 
talk to the master to find out which indexes to search or update, then 
they'll talk directly to slaves to perform searches and updates.


Following is an outline of how this might look.

We assume that, within an index, a file with a given name is written 
only once.  Index versions are sets of files, and a new version of an 
index is likely to share most files with the prior version.  Versions 
are numbered.  An index server should keep old versions of each index 
for a while, not immediately removing old files.


public class IndexVersion {
  String Id;   // unique name of the index
  int version; // the version of the index
}

public class IndexLocation {
  IndexVersion indexVersion;
  InetSocketAddress location;
}

public interface ClientToMasterProtocol {
  IndexLocation[] getSearchableIndexes();
  IndexLocation getUpdateableIndex(String id);
}

public interface ClientToSlaveProtocol {
  // normal update
  void addDocument(String index, Document doc);
  int[] removeDocuments(String index, Term term);
  void commitVersion(String index);

  // batch update
  void addIndex(String index, IndexLocation indexToAdd);

  // search
  SearchResults search(IndexVersion i, Query query, Sort sort, int n);
}

public interface SlaveToMasterProtocol {
  // sends currently searchable indexes
  // recieves updated indexes that we should replicate/update
  public IndexLocation[] heartbeat(IndexVersion[] searchableIndexes);
}

public interface SlaveToSlaveProtocol {
  String[] getFileSet(IndexVersion indexVersion);
  byte[] getFileContent(IndexVersion indexVersion, String file);
  // based on experience in Hadoop, we probably wouldn't really use
  // RPC to send file content, but rather HTTP.
}

The master thus maintains the set of indexes that are available for 
search, keeps track of which slave should handle changes to an index and 
initiates index synchronization between slaves.  The master can be 
configured to replicate indexes a specified number of times.


The client library can cache the current set of searchable indexes and 
periodically refresh it.  Searches are broadcast to one index with each 
id and return merged results.  The client will load-balance both 
searches and updates.


Deletions could be broadcast to all slaves.  That would probably be fast 
enough.  Alternately, indexes could be partitioned by a hash of each 
document's unique id, permitting deletions to be routed to the 
appropriate slave.


Does this make sense?  Does it sound like it would be useful to Solr? 
To Nutch?  To others?  Who would be interested and able to work on it?


Doug