Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
Control is easily retained if you make pluggable the selection of shards to
which you want to do the horizontal broadcast.  The shard management layer
shouldn't know or care what query you are doing and in most cases it should
just use the trivial all-shards selection policy.

On Sun, Jan 17, 2010 at 7:34 AM, Yonik Seeley wrote:

> > I would argue that the current model has been adopted out of necessity,
> and
> > not because of the users' preference.
>
> I think it's both - I've seen quite a few people that really wanted to
> partition by time for example (and they made some compelling cases for
> doing so).  Seems like a good goal would be to support the customer
> having various levels of control.




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
+1

Hadoop still calls it a copy of a block if you have replication factor of
1.  Why not?

(for that matter, I still call it an integer if it has a value of 1)

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki  wrote:

> I originally started off with "replica" too... but there may only be
>> one copy of a physical shard, it seemed strange to call it a replica.
>>
>
> Yeah .. it's a replica with a replication factor of 1 :)




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Ted Dunning
Jason V and Jason R have done just that.

Great idea.  Cool work.  But a unified management interface would *really*
be nice.

On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki  wrote:

> Well, then if we don't intend to support updates in this iteration then
> perhaps there is no need to change anything in Solr, just extend Katta to
> run Solr searchers ... :P
>



-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Yonik Seeley
On Sun, Jan 17, 2010 at 9:06 AM, Andrzej Bialecki  wrote:
> On 2010-01-16 21:11, Yonik Seeley wrote:
>> If we were building from scratch perhaps - but it seems like if we can
>> just model what people do today with Solr (but just make it a lot
>> easier), that's a good start.  The opaque model is what we have today,
>> and it's conceptually simple... the complete collection consists of
>> all the unique shard ids (or slices) you know about.
>
> I would argue that the current model has been adopted out of necessity, and
> not because of the users' preference.

I think it's both - I've seen quite a few people that really wanted to
partition by time for example (and they made some compelling cases for
doing so).  Seems like a good goal would be to support the customer
having various levels of control.

> Unless you want an expert-level total
> control over what node runs what part of the index, isn't it much more
> convenient to delegate all the partitioning and deployment to your "search
> cluster" instead of managing the partitioning and deployment yourself?

Certainly - we do want to get to the "just handle everything for me"
phase.  It just feels like there is a lot more development work to do
before we can make that happen.  Reliably supporting near realtime
updates in a replicated environment is hard and will take some time.

-Yonik
http://www.lucidimagination.com


Re: Solr Cloud wiki and branch notes

2010-01-17 Thread Andrzej Bialecki

On 2010-01-16 21:11, Yonik Seeley wrote:


Agreed - but it could be as simple as qualifying this with "from shardX on
node2".


Right - it's pretty clear there are both physical and logical
shards... but it's less clear to me at this point if distinguishing
them in the vocabulary helps or hurts.


You _are_ distinguishing them, you just use "physical" and "logical" :) 
I'm in favor of using "shard" for the logical entity, and "copy" or 
"replica" for the physical one. Whichever term we choose, we need to be 
clear about this distinction because multiple physical copies (replicas) 
may be deployed to multiple nodes, even though they contribute only one 
logical shard.





The opaque model means it's more difficult to support updates.
IMHO it makes
sense to start with a set of stricter assumptions


If we were building from scratch perhaps - but it seems like if we can
just model what people do today with Solr (but just make it a lot
easier), that's a good start.  The opaque model is what we have today,
and it's conceptually simple... the complete collection consists of
all the unique shard ids (or slices) you know about.


I would argue that the current model has been adopted out of necessity, 
and not because of the users' preference. Unless you want an 
expert-level total control over what node runs what part of the index, 
isn't it much more convenient to delegate all the partitioning and 
deployment to your "search cluster" instead of managing the partitioning 
and deployment yourself? Users have to do it now because Solr has no 
mechanism for this.




And we don't need to support everything in this model - I think we
should and will also support shards where Solr does all the
partitioning and mapping of the ID space (pluggable of course) and
then we can offer more services based on that knowledge.


Well, then if we don't intend to support updates in this iteration then 
perhaps there is no need to change anything in Solr, just extend Katta 
to run Solr searchers ... :P





You've also used some slightly new terminology... "shard ID" as
opposed to just shard, which reinforces the need for different
terminology for the physical vs the logical.


You got me ;) yes, when I say "shard" I mean the logical entity, as defined
by a set of documents - physical shard I would call a replica.


I originally started off with "replica" too... but there may only be
one copy of a physical shard, it seemed strange to call it a replica.


Yeah .. it's a replica with a replication factor of 1 :)

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Ted Dunning
My experience with Katta is that very quickly my developers adopted index as
the aggregate of all the shards which is exactly what Andrzej is proposing.
Confusion with the "index contains shards", "nodes host shards" terminology
has been minimal.

On Sat, Jan 16, 2010 at 11:40 AM, Andrzej Bialecki  wrote:

> We're currently using "collection".  Notice how you had to add
>> (global) to clarify what you meant.  I fear that a sentence like "what
>> index are you querying" would need constant clarification.
>>
>
> I avoided the word "collection", because Solr deploys various cores under
> "collectionX" names, leading users to assume that core == collection.
> "Global index" is two words but it's unambiguous. I'm fine with the
> "collection" if we clarify the definition and avoid using this term for
> other stuff.




-- 
Ted Dunning, CTO
DeepDyve


Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Yonik Seeley
On Sat, Jan 16, 2010 at 2:40 PM, Andrzej Bialecki  wrote:
> I avoided the word "collection", because Solr deploys various cores under
> "collectionX" names, leading users to assume that core == collection.

For distributed search, it's already common to name the cores the same
thing for shards of the same collection on different boxes.  In fact,
we're currently using the core name as a default for the collection
name when bootstrapping.

>> Even the statement "what shard did that response come from" becomes
>> ambiguous since we could be talking a part of the index (ShardX) or we
>> could be talking about the specific physical shard/server (it came
>> from node2).
>
> Agreed - but it could be as simple as qualifying this with "from shardX on
> node2".

Right - it's pretty clear there are both physical and logical
shards... but it's less clear to me at this point if distinguishing
them in the vocabulary helps or hurts.

> The opaque model means it's more difficult to support updates.
> IMHO it makes
> sense to start with a set of stricter assumptions

If we were building from scratch perhaps - but it seems like if we can
just model what people do today with Solr (but just make it a lot
easier), that's a good start.  The opaque model is what we have today,
and it's conceptually simple... the complete collection consists of
all the unique shard ids (or slices) you know about.

And we don't need to support everything in this model - I think we
should and will also support shards where Solr does all the
partitioning and mapping of the ID space (pluggable of course) and
then we can offer more services based on that knowledge.

>> You've also used some slightly new terminology... "shard ID" as
>> opposed to just shard, which reinforces the need for different
>> terminology for the physical vs the logical.
>
> You got me ;) yes, when I say "shard" I mean the logical entity, as defined
> by a set of documents - physical shard I would call a replica.

I originally started off with "replica" too... but there may only be
one copy of a physical shard, it seemed strange to call it a replica.

-Yonik
http://www.lucidimagination.com


Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Mark Miller
Andrzej Bialecki wrote:
>
> I avoided the word "collection", because Solr deploys various cores
> under "collectionX" names, leading users to assume that core ==
> collection. "Global index" is two words but it's unambiguous. I'm fine
> with the "collection" if we clarify the definition and avoid using
> this term for other stuff.
No, currently it does not. There is a proposal for this to make certain
boostrapping "stuff" with cloud easier, but Solr uses no such
collectionX convention in Solr at the moment (that I have ever seen).

-- 
- Mark

http://www.lucidimagination.com





Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Andrzej Bialecki

On 2010-01-16 18:18, Yonik Seeley wrote:

On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki  wrote:

Hi,

My 0.02 PLN on the subject ...

Terminology
---
First the terminology: reading your emails I have a feeling that my head is
about to explode. We have to agree on the vocabulary, otherwise we have no
hope of reaching any consensus.


We not only need more standardized terminology for email, but for
exact strings to put in zookeeper.


Indeed.


I propose the following vocabulary that has
been in use and is generally understood:

* (global) search index: a complete collection of all indexed documents.
 From a conceptual point of view, this is our complete search space.


We're currently using "collection".  Notice how you had to add
(global) to clarify what you meant.  I fear that a sentence like "what
index are you querying" would need constant clarification.


I avoided the word "collection", because Solr deploys various cores 
under "collectionX" names, leading users to assume that core == 
collection. "Global index" is two words but it's unambiguous. I'm fine 
with the "collection" if we clarify the definition and avoid using this 
term for other stuff.





* index shard: a non-overlapping part of the search index.


When you get down to modeling it, this gets a little squishy and is
hard to avoid using two words.
Say the complete collection is covered by ShardX and ShardY.

A way to model this is like so:

/collection
   /ShardX
 /node1 [url=..., version=...]
 /node2 [url=..., version=...]
 /node3 [url=..., version=...]

It becomes clearer that there are logical shards and physical shards.
If shards are updateable, they may have different versions at
different times.


Yes, but they are supposed to be ultimately consistent - that's where 
the replication comes in.



It may also be that all the physical shards go down,
but the logical "ShardX" remains.


Yes, as a missing piece of the global index not served currently by any 
node, thus leading to incomplete results.




Even the statement "what shard did that response come from" becomes
ambiguous since we could be talking a part of the index (ShardX) or we
could be talking about the specific physical shard/server (it came
from node2).


Agreed - but it could be as simple as qualifying this with "from shardX 
on node2".


This would be quite natural if you consider that even the same query 
submitted again could be answered by a different set of nodes that 
manage the same set of shards. E.g. with two nodes {n1, n2} and 2 shards 
{s1,s2}, and the replication factor of 2, the selection of what shard on 
what node contributes to the list of results could look like this (time 
in the Y axis):


q1 {n1:s1,n2:s2}
q2 {n1:s2,n2:s1}
...





All shards in the
system form together the complete search space of the search index. E.g.
having initially one big index I could divide it into multiple shards using
MultiPassIndexSplitter, and if I combined all the shards again, using
IndexMerger, I should obtain the original complete search index (modulo
changed Lucene docids .. doesn't matter). I strongly believe in
micro-sharding, because they are much easier to handle and replicate. Also,
since we control the shards we don't have to deal with overlapping shards,
which is the curse of P2P search.


Prohibiting overlapping shards effectively prohibits ever merging or
splitting shards online (it could only be an offline or blocking
operation).  Anyway, in the opaque shard model (where clients create
shards, and we don't know how they partitioned them), shards would
have to be non-overlapping.


The opaque model means it's more difficult to support updates. IMHO it 
makes sense to start with a set of stricter assumptions in order to 
build something workable, and then relax them as we gain experience.




As far as the future (allocation and rebalancing), I'm happy with a
small-shard approach that avoids merging and splitting.  It carries
some other nice little side benefits as well.


* partitioning: a method whereby we can determine the target shard ID based
on a doc ID.


I think we're all using partitioning the same way, but that's a
narrower definition than needed.
A user may partition the index, and Solr may not have the mapping of
docid to shard.


See above - of course this would be cool and extra convenient to users, 
but much more difficult to implement so that it supports updates.




You've also used some slightly new terminology... "shard ID" as
opposed to just shard, which reinforces the need for different
terminology for the physical vs the logical.


You got me ;) yes, when I say "shard" I mean the logical entity, as 
defined by a set of documents - physical shard I would call a replica.




Now, to translate this into Solr-speak: depending on the details of the
design, and the evolution of Solr, one search node could be one Solr
instance that manages one shard per core.


A solr core is a bit too heavyweight for a microshard thou

Re: Solr Cloud wiki and branch notes

2010-01-16 Thread Yonik Seeley
On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki  wrote:
> Hi,
>
> My 0.02 PLN on the subject ...
>
> Terminology
> ---
> First the terminology: reading your emails I have a feeling that my head is
> about to explode. We have to agree on the vocabulary, otherwise we have no
> hope of reaching any consensus.

We not only need more standardized terminology for email, but for
exact strings to put in zookeeper.

> I propose the following vocabulary that has
> been in use and is generally understood:
>
> * (global) search index: a complete collection of all indexed documents.
> From a conceptual point of view, this is our complete search space.

We're currently using "collection".  Notice how you had to add
(global) to clarify what you meant.  I fear that a sentence like "what
index are you querying" would need constant clarification.

> * index shard: a non-overlapping part of the search index.

When you get down to modeling it, this gets a little squishy and is
hard to avoid using two words.
Say the complete collection is covered by ShardX and ShardY.

A way to model this is like so:

/collection
  /ShardX
/node1 [url=..., version=...]
/node2 [url=..., version=...]
/node3 [url=..., version=...]

It becomes clearer that there are logical shards and physical shards.
If shards are updateable, they may have different versions at
different times.  It may also be that all the physical shards go down,
but the logical "ShardX" remains.

Even the statement "what shard did that response come from" becomes
ambiguous since we could be talking a part of the index (ShardX) or we
could be talking about the specific physical shard/server (it came
from node2).

> All shards in the
> system form together the complete search space of the search index. E.g.
> having initially one big index I could divide it into multiple shards using
> MultiPassIndexSplitter, and if I combined all the shards again, using
> IndexMerger, I should obtain the original complete search index (modulo
> changed Lucene docids .. doesn't matter). I strongly believe in
> micro-sharding, because they are much easier to handle and replicate. Also,
> since we control the shards we don't have to deal with overlapping shards,
> which is the curse of P2P search.

Prohibiting overlapping shards effectively prohibits ever merging or
splitting shards online (it could only be an offline or blocking
operation).  Anyway, in the opaque shard model (where clients create
shards, and we don't know how they partitioned them), shards would
have to be non-overlapping.

As far as the future (allocation and rebalancing), I'm happy with a
small-shard approach that avoids merging and splitting.  It carries
some other nice little side benefits as well.

> * partitioning: a method whereby we can determine the target shard ID based
> on a doc ID.

I think we're all using partitioning the same way, but that's a
narrower definition than needed.
A user may partition the index, and Solr may not have the mapping of
docid to shard.

You've also used some slightly new terminology... "shard ID" as
opposed to just shard, which reinforces the need for different
terminology for the physical vs the logical.

> * search node: an application that provides search and update to one or more
> shards.
>
> * search host: a machine that may run 1 or more search nodes.
>
> * Shard Manager: a component that keeps track of allocation of shards to
> nodes (plus more, see below).
>
> Now, to translate this into Solr-speak: depending on the details of the
> design, and the evolution of Solr, one search node could be one Solr
> instance that manages one shard per core.

A solr core is a bit too heavyweight for a microshard though.
I think a single solr core really needs to be able to handle multiple
shards for this to become practical.

> Let's forget here about the
> current distributed search component, and the current replication

Heh.  I think this is what is causing some of the mismatches...
different starting points and different assumptions.

> - they
> could be useful in this design as a raw transport mechanism, but someone
> else would be calling the shots (see below).

Seems like we need to be flexible in allowing customers to call the
shots to varying degrees.

-Yonik
http://www.lucidimagination.com

> Architecture
> 
> The replication and load balancing is a problem with many existing
> solutions, and this one in particular reminds me strongly of the Hadoop
> HDFS. In fact, early on during the development of Hadoop [1] I wondered
> whether we could reuse HDFS to manage Lucene indexes instead of opaque
> blocks of fixed size. It turned out to be infeasible, but the model of
> Namenode/Datanode still looks useful in our case, too.
>
> I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper
> that we could reuse in our design. The following is just a straightforward
> port of the Namenode/Datanode concept.
>
> Let's imagine a component called ShardM

Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Ted Dunning
On Fri, Jan 15, 2010 at 4:36 PM, Andrzej Bialecki  wrote:

> My 0.02 PLN on the subject ...
>

Polish currency seems pretty strong lately.  There are a lot of good ideas
for this small sum.


>
> Terminology
>
> * (global) search index
> * index shard:
> * partitioning:
> * search node:
> * search host:
> * Shard Manager:
>

I think that these terms are excellent.


> The replication and load balancing is a problem with many existing
> solutions, and this one in particular reminds me strongly of the Hadoop
> HDFS. In fact, early on during the development of Hadoop [1] I wondered
> whether we could reuse HDFS to manage Lucene indexes instead of opaque
> blocks of fixed size. It turned out to be infeasible, but the model of
> Namenode/Datanode still looks useful in our case, too.
>

I have seen the analogy with hadoop in managing a Katta cluster.  The
randomized assignment provides very many of the same robustness benefits as
a map-reduce architecture provides for parallel computing.


> I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper
> that we could reuse in our design. The following is just a straightforward
> port of the Namenode/Datanode concept.
>
> Let's imagine a component called ShardManager that is responsible for
> managing the following data:
>
> * list of shard ID-s that together form the complete search index,
> * for each shard ID, list of search nodes that serve this shard.
> * issuing replication requests
> * maintaining the partitioning function (see below), so that updates are
> directed to correct shards
> * maintaining heartbeat to check for dead nodes
> * providing search clients with a list of nodes to query in order to obtain
> all results from the search index.
>

I think that this is close.

I think that the list of search nodes that serve each shard should be
maintained by the nodes themselves.  Moreover, ZK provides the ability to
have this list magically update if the node dies.

This means that the need for heartbeats virtually disappears.

In addition, I think that a substrate like ZK should be used to provide
search clients with the information about which nodes have which shards and
the clients should themselves decide how to cover the set of shards with a
list of nodes.  This means that the ShardManager is *completely* out of the
real-time pathway.



> ... I believe most of the above functionality could be facilitated by
> Zookeeper, including the election of the node that runs the ShardManager.
>

Absolutely.


> Updates
> ---
> We need a partitioning schema that splits documents more or less evenly
> among shards, and at the same time allows us to split or merge unbalanced
> shards. The simplest function that we could imagine is the following:
>
>hash(docId) % numShards
>
> though this has the disadvantage that any larger update will affect
> multiple shards, thus creating an avalanche of replication requests ... so a
> sequential model would be probably better, where ranges of docIds are
> assigned to shards.
>

A hybrid is quite possible:

hash(floor(docId / sequence-size)) % numShards

this gives sequential assignment of sequence-size documents at a time.
Sequence-size should be small to distribute query results and update loads
across all nodes.  Sequence size should be large to avoid replication of all
shards after a focussed update.  Balance is necessary.


> Now, if any particular shard is too unbalanced, e.g. too large, it could be
> further split in two halves, and the ShardManager would have to record this
> exception. This is a very similar process to a region split in HBase, or a
> page split in btree DBs. Conversely, shards that are too small could be
> joined. This is the icing on the cake, so we can leave it for later.
>

Leaving for later is a great idea.  With relatively small shards, I am able
to parallelize indexing to the point that a terabyte or so of documents
index in a few hours.  Combined with a small sequence-size in the shard
distribution function so that all shards grow together, it is easy to plan
for 3x growth or more without the need to shard splitting.  With a complete
index being so cheap, I can afford to simply reindex from scratch with a
different shard count if I feel like it.



> Search
> --
> There should be a component sometimes referred to as query integrator (or
> search front-end) that is the entry and exit point for user search requests.
> On receiving a search request this component gets a list of randomly
> selected nodes from SearchManager to contact (the list containing all shards
> that form the global index), sends the query and integrates partial results
> (under a configurable policy for timeouts/early termination), and sends back
> the assembled results to the user.
>

Yes in outline.

A few details:

I think that the shard cover computation should, in fact, be done on the
client side.  One reason is that the node/shard state is relatively static
and if all clients retrieve the f

Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Andrzej Bialecki

Hi,

My 0.02 PLN on the subject ...

Terminology
---
First the terminology: reading your emails I have a feeling that my head 
is about to explode. We have to agree on the vocabulary, otherwise we 
have no hope of reaching any consensus. I propose the following 
vocabulary that has been in use and is generally understood:


* (global) search index: a complete collection of all indexed documents. 
From a conceptual point of view, this is our complete search space.


* index shard: a non-overlapping part of the search index. All shards in 
the system form together the complete search space of the search index. 
E.g. having initially one big index I could divide it into multiple 
shards using MultiPassIndexSplitter, and if I combined all the shards 
again, using IndexMerger, I should obtain the original complete search 
index (modulo changed Lucene docids .. doesn't matter). I strongly 
believe in micro-sharding, because they are much easier to handle and 
replicate. Also, since we control the shards we don't have to deal with 
overlapping shards, which is the curse of P2P search.


* partitioning: a method whereby we can determine the target shard ID 
based on a doc ID.


* search node: an application that provides search and update to one or 
more shards.


* search host: a machine that may run 1 or more search nodes.

* Shard Manager: a component that keeps track of allocation of shards to 
nodes (plus more, see below).


Now, to translate this into Solr-speak: depending on the details of the 
design, and the evolution of Solr, one search node could be one Solr 
instance that manages one shard per core. Let's forget here about the 
current distributed search component, and the current replication - they 
could be useful in this design as a raw transport mechanism, but someone 
else would be calling the shots (see below).


Architecture

The replication and load balancing is a problem with many existing 
solutions, and this one in particular reminds me strongly of the Hadoop 
HDFS. In fact, early on during the development of Hadoop [1] I wondered 
whether we could reuse HDFS to manage Lucene indexes instead of opaque 
blocks of fixed size. It turned out to be infeasible, but the model of 
Namenode/Datanode still looks useful in our case, too.


I believe there are many useful lessons lurking in 
Hadoop/HBase/Zookeeper that we could reuse in our design. The following 
is just a straightforward port of the Namenode/Datanode concept.


Let's imagine a component called ShardManager that is responsible for 
managing the following data:


* list of shard ID-s that together form the complete search index,
* for each shard ID, list of search nodes that serve this shard.
* issuing replication requests
* maintaining the partitioning function (see below), so that updates are 
directed to correct shards

* maintaining heartbeat to check for dead nodes
* providing search clients with a list of nodes to query in order to 
obtain all results from the search index.


Whenever a new search node comes up, it reports its local shard ID-s 
(versioned) to the ShardManager. Based on these reports from the 
currently active nodes, the ShardManager builds this mapping of shards 
to nodes, and requests replication if some shards are too old, or if the 
replication count is too low, allocating these shards to selected nodes 
(based on a policy of some kind).


I believe most of the above functionality could be facilitated by 
Zookeeper, including the election of the node that runs the ShardManager.


Updates
---
We need a partitioning schema that splits documents more or less evenly 
among shards, and at the same time allows us to split or merge 
unbalanced shards. The simplest function that we could imagine is the 
following:


hash(docId) % numShards

though this has the disadvantage that any larger update will affect 
multiple shards, thus creating an avalanche of replication requests ... 
so a sequential model would be probably better, where ranges of docIds 
are assigned to shards.


Now, if any particular shard is too unbalanced, e.g. too large, it could 
be further split in two halves, and the ShardManager would have to 
record this exception. This is a very similar process to a region split 
in HBase, or a page split in btree DBs. Conversely, shards that are too 
small could be joined. This is the icing on the cake, so we can leave it 
for later.


After commit, a node contacts the ShardManager to report a new version 
of the shard. ShardManager issues replication requests to other nodes 
that hold a replica of this shard.


Search
--
There should be a component sometimes referred to as query integrator 
(or search front-end) that is the entry and exit point for user search 
requests. On receiving a search request this component gets a list of 
randomly selected nodes from SearchManager to contact (the list 
containing all shards that form the global index), sends the query and 
integrate

Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Jason Rutherglen
> This is really about doing not-so-much in the very near term,
> while thinking ahead to the longer term.

Lets have a page dedicated to release 1.0 of cloud? I feel
uncomfortable editing the existing wiki because I don't know
what the plans are for the first release.

I need to revisit Katta as my short term plans include using
Zookeeper (not for failover) but simply for deploying
shards/cores to servers, and nothing else. I can use the core
admin interface to bring them online, update them etc. Or I'll
just implement something and make a patch to Solr... Thinking
out loud:

/anyname/shardlist-v1.txt /anyname/shardlist-v2.txt

where shardlist-v1.txt contains:
corename,coredownloadpath,instanceDir

Where coredownloadpath can be any URL including hftp, hdfs, ftp, http, https.

Where the system automagically uninstalls cores that should no
longer exist on a given server. Cores with the same name
deployed to the same server would use the reload command,
otherwise the create command.

Where there's a ZK listener on the /anyname directory for new
files that are greater than the last known installed
shardlist.txt.

Alternatively, an even simpler design would be uploading a
solr.xml file per server, something like:
/anyname/solr-prod01.solr.xml

Which a directory listener on each server parses and makes the
necessary changes (without restarting Tomcat).

On the search side in this system, I'd need to wait for the
cores to complete their install, then swap in a new core on the
search proxy that represents the new version of the corelist,
then the old cores could go away. This isn't very different than
the segmentinfos system used in Lucene IMO.

On Fri, Jan 15, 2010 at 1:53 PM, Yonik Seeley  wrote:
> On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen
>  wrote:
>> The page is huge, which signals to me maybe we're trying to do
>> too much
>
> This is really about doing not-so-much in the very near term, while
> thinking ahead to the longer term.
>
>> Revamping distributed search could be in a different branch
>> (this includes partial results)
>
> That could just be a separate patch - it's scope is not that broad (I
> think there may already be a JIRA issue open for it).
>
>> Having a single solrconfig and schema for each core/shard in a
>> collection won't work for me. I need to define each core
>> externally, and I don't want Solr-Cloud to manage this, how will
>> this scenario work?
>
> We do plan on each core being able to have it's own schema (so one
> could try out a version of a schema and gradually migrate the
> cluster).
>
> It could also be possible to define a schema as "local" (i.e. use the
> one on the local file system)
>
>> A host is about the same as node, I don't see the difference, or
>> enough of one
>
> A host is the hardware. It will have limited disk, limited CPU, etc.
> At some point we will want to model this... multiple nodes could be
> launched on one box.  We're not doing anything with it now, and won't
> in the near future.
>
>> Cluster resizing and rebalancing can and should be built
>> externally and hopefully after an initial release that does the
>> basics well
>
> The initial release will certainly not be doing any resizing or rebalancing.
> We should allow this to be done externally.  In the future, we
> shouldn't require that this be done externally though (i.e. we should
> somehow alow the cluster to grow w/o people having to write code).
>
>> Collection is a group of cores?
>
> A collection of documents - the complete search index.  It has a
> single schema, etc.
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Solr Cloud wiki and branch notes

2010-01-15 Thread Yonik Seeley
On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen
 wrote:
> The page is huge, which signals to me maybe we're trying to do
> too much

This is really about doing not-so-much in the very near term, while
thinking ahead to the longer term.

> Revamping distributed search could be in a different branch
> (this includes partial results)

That could just be a separate patch - it's scope is not that broad (I
think there may already be a JIRA issue open for it).

> Having a single solrconfig and schema for each core/shard in a
> collection won't work for me. I need to define each core
> externally, and I don't want Solr-Cloud to manage this, how will
> this scenario work?

We do plan on each core being able to have it's own schema (so one
could try out a version of a schema and gradually migrate the
cluster).

It could also be possible to define a schema as "local" (i.e. use the
one on the local file system)

> A host is about the same as node, I don't see the difference, or
> enough of one

A host is the hardware. It will have limited disk, limited CPU, etc.
At some point we will want to model this... multiple nodes could be
launched on one box.  We're not doing anything with it now, and won't
in the near future.

> Cluster resizing and rebalancing can and should be built
> externally and hopefully after an initial release that does the
> basics well

The initial release will certainly not be doing any resizing or rebalancing.
We should allow this to be done externally.  In the future, we
shouldn't require that this be done externally though (i.e. we should
somehow alow the cluster to grow w/o people having to write code).

> Collection is a group of cores?

A collection of documents - the complete search index.  It has a
single schema, etc.

-Yonik
http://www.lucidimagination.com


Solr Cloud wiki and branch notes

2010-01-15 Thread Jason Rutherglen
Here's some rough notes after running the unit tests, reviewing
some of the code (though not understanding it), and reviewing
the wiki page http://wiki.apache.org/solr/SolrCloud


We need a protocol in the URL, otherwise it's inflexible

I'm overwhelmed with all the ?? question areas of the document.

The page is huge, which signals to me maybe we're trying to do
too much

Revamping distributed search could be in a different branch
(this includes partial results)

Having a single solrconfig and schema for each core/shard in a
collection won't work for me. I need to define each core
externally, and I don't want Solr-Cloud to manage this, how will
this scenario work?

A host is about the same as node, I don't see the difference, or
enough of one

Cluster resizing and rebalancing can and should be built
externally and hopefully after an initial release that does the
basics well

Collection is a group of cores?

I like the model -> reality system. However how does the
versioning work? We need to know what the conversion progress
is? How will the queuing of in-progress alterations work (this
seems hard, I'd rather focus on this, make it work well, than
mess with other things like load balancing in the first release?
i.e. if this doesn't work well, Solr-Cloud isn't production
ready for me)

Shard Identification, this falls under too ambitious right now
IMO

I think we need a wiki page of just the basics of core/shard
management, implement that, then build all the rest of the features on top...
Otherwise this thing feels like it's going to be a nightmare to
test and deploy in production.