Re: Solr Cloud wiki and branch notes
Control is easily retained if you make pluggable the selection of shards to which you want to do the horizontal broadcast. The shard management layer shouldn't know or care what query you are doing and in most cases it should just use the trivial all-shards selection policy. On Sun, Jan 17, 2010 at 7:34 AM, Yonik Seeley wrote: > > I would argue that the current model has been adopted out of necessity, > and > > not because of the users' preference. > > I think it's both - I've seen quite a few people that really wanted to > partition by time for example (and they made some compelling cases for > doing so). Seems like a good goal would be to support the customer > having various levels of control. -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
+1 Hadoop still calls it a copy of a block if you have replication factor of 1. Why not? (for that matter, I still call it an integer if it has a value of 1) On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki wrote: > I originally started off with "replica" too... but there may only be >> one copy of a physical shard, it seemed strange to call it a replica. >> > > Yeah .. it's a replica with a replication factor of 1 :) -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
Jason V and Jason R have done just that. Great idea. Cool work. But a unified management interface would *really* be nice. On Sun, Jan 17, 2010 at 6:06 AM, Andrzej Bialecki wrote: > Well, then if we don't intend to support updates in this iteration then > perhaps there is no need to change anything in Solr, just extend Katta to > run Solr searchers ... :P > -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
On Sun, Jan 17, 2010 at 9:06 AM, Andrzej Bialecki wrote: > On 2010-01-16 21:11, Yonik Seeley wrote: >> If we were building from scratch perhaps - but it seems like if we can >> just model what people do today with Solr (but just make it a lot >> easier), that's a good start. The opaque model is what we have today, >> and it's conceptually simple... the complete collection consists of >> all the unique shard ids (or slices) you know about. > > I would argue that the current model has been adopted out of necessity, and > not because of the users' preference. I think it's both - I've seen quite a few people that really wanted to partition by time for example (and they made some compelling cases for doing so). Seems like a good goal would be to support the customer having various levels of control. > Unless you want an expert-level total > control over what node runs what part of the index, isn't it much more > convenient to delegate all the partitioning and deployment to your "search > cluster" instead of managing the partitioning and deployment yourself? Certainly - we do want to get to the "just handle everything for me" phase. It just feels like there is a lot more development work to do before we can make that happen. Reliably supporting near realtime updates in a replicated environment is hard and will take some time. -Yonik http://www.lucidimagination.com
Re: Solr Cloud wiki and branch notes
On 2010-01-16 21:11, Yonik Seeley wrote: Agreed - but it could be as simple as qualifying this with "from shardX on node2". Right - it's pretty clear there are both physical and logical shards... but it's less clear to me at this point if distinguishing them in the vocabulary helps or hurts. You _are_ distinguishing them, you just use "physical" and "logical" :) I'm in favor of using "shard" for the logical entity, and "copy" or "replica" for the physical one. Whichever term we choose, we need to be clear about this distinction because multiple physical copies (replicas) may be deployed to multiple nodes, even though they contribute only one logical shard. The opaque model means it's more difficult to support updates. IMHO it makes sense to start with a set of stricter assumptions If we were building from scratch perhaps - but it seems like if we can just model what people do today with Solr (but just make it a lot easier), that's a good start. The opaque model is what we have today, and it's conceptually simple... the complete collection consists of all the unique shard ids (or slices) you know about. I would argue that the current model has been adopted out of necessity, and not because of the users' preference. Unless you want an expert-level total control over what node runs what part of the index, isn't it much more convenient to delegate all the partitioning and deployment to your "search cluster" instead of managing the partitioning and deployment yourself? Users have to do it now because Solr has no mechanism for this. And we don't need to support everything in this model - I think we should and will also support shards where Solr does all the partitioning and mapping of the ID space (pluggable of course) and then we can offer more services based on that knowledge. Well, then if we don't intend to support updates in this iteration then perhaps there is no need to change anything in Solr, just extend Katta to run Solr searchers ... :P You've also used some slightly new terminology... "shard ID" as opposed to just shard, which reinforces the need for different terminology for the physical vs the logical. You got me ;) yes, when I say "shard" I mean the logical entity, as defined by a set of documents - physical shard I would call a replica. I originally started off with "replica" too... but there may only be one copy of a physical shard, it seemed strange to call it a replica. Yeah .. it's a replica with a replication factor of 1 :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Solr Cloud wiki and branch notes
My experience with Katta is that very quickly my developers adopted index as the aggregate of all the shards which is exactly what Andrzej is proposing. Confusion with the "index contains shards", "nodes host shards" terminology has been minimal. On Sat, Jan 16, 2010 at 11:40 AM, Andrzej Bialecki wrote: > We're currently using "collection". Notice how you had to add >> (global) to clarify what you meant. I fear that a sentence like "what >> index are you querying" would need constant clarification. >> > > I avoided the word "collection", because Solr deploys various cores under > "collectionX" names, leading users to assume that core == collection. > "Global index" is two words but it's unambiguous. I'm fine with the > "collection" if we clarify the definition and avoid using this term for > other stuff. -- Ted Dunning, CTO DeepDyve
Re: Solr Cloud wiki and branch notes
On Sat, Jan 16, 2010 at 2:40 PM, Andrzej Bialecki wrote: > I avoided the word "collection", because Solr deploys various cores under > "collectionX" names, leading users to assume that core == collection. For distributed search, it's already common to name the cores the same thing for shards of the same collection on different boxes. In fact, we're currently using the core name as a default for the collection name when bootstrapping. >> Even the statement "what shard did that response come from" becomes >> ambiguous since we could be talking a part of the index (ShardX) or we >> could be talking about the specific physical shard/server (it came >> from node2). > > Agreed - but it could be as simple as qualifying this with "from shardX on > node2". Right - it's pretty clear there are both physical and logical shards... but it's less clear to me at this point if distinguishing them in the vocabulary helps or hurts. > The opaque model means it's more difficult to support updates. > IMHO it makes > sense to start with a set of stricter assumptions If we were building from scratch perhaps - but it seems like if we can just model what people do today with Solr (but just make it a lot easier), that's a good start. The opaque model is what we have today, and it's conceptually simple... the complete collection consists of all the unique shard ids (or slices) you know about. And we don't need to support everything in this model - I think we should and will also support shards where Solr does all the partitioning and mapping of the ID space (pluggable of course) and then we can offer more services based on that knowledge. >> You've also used some slightly new terminology... "shard ID" as >> opposed to just shard, which reinforces the need for different >> terminology for the physical vs the logical. > > You got me ;) yes, when I say "shard" I mean the logical entity, as defined > by a set of documents - physical shard I would call a replica. I originally started off with "replica" too... but there may only be one copy of a physical shard, it seemed strange to call it a replica. -Yonik http://www.lucidimagination.com
Re: Solr Cloud wiki and branch notes
Andrzej Bialecki wrote: > > I avoided the word "collection", because Solr deploys various cores > under "collectionX" names, leading users to assume that core == > collection. "Global index" is two words but it's unambiguous. I'm fine > with the "collection" if we clarify the definition and avoid using > this term for other stuff. No, currently it does not. There is a proposal for this to make certain boostrapping "stuff" with cloud easier, but Solr uses no such collectionX convention in Solr at the moment (that I have ever seen). -- - Mark http://www.lucidimagination.com
Re: Solr Cloud wiki and branch notes
On 2010-01-16 18:18, Yonik Seeley wrote: On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki wrote: Hi, My 0.02 PLN on the subject ... Terminology --- First the terminology: reading your emails I have a feeling that my head is about to explode. We have to agree on the vocabulary, otherwise we have no hope of reaching any consensus. We not only need more standardized terminology for email, but for exact strings to put in zookeeper. Indeed. I propose the following vocabulary that has been in use and is generally understood: * (global) search index: a complete collection of all indexed documents. From a conceptual point of view, this is our complete search space. We're currently using "collection". Notice how you had to add (global) to clarify what you meant. I fear that a sentence like "what index are you querying" would need constant clarification. I avoided the word "collection", because Solr deploys various cores under "collectionX" names, leading users to assume that core == collection. "Global index" is two words but it's unambiguous. I'm fine with the "collection" if we clarify the definition and avoid using this term for other stuff. * index shard: a non-overlapping part of the search index. When you get down to modeling it, this gets a little squishy and is hard to avoid using two words. Say the complete collection is covered by ShardX and ShardY. A way to model this is like so: /collection /ShardX /node1 [url=..., version=...] /node2 [url=..., version=...] /node3 [url=..., version=...] It becomes clearer that there are logical shards and physical shards. If shards are updateable, they may have different versions at different times. Yes, but they are supposed to be ultimately consistent - that's where the replication comes in. It may also be that all the physical shards go down, but the logical "ShardX" remains. Yes, as a missing piece of the global index not served currently by any node, thus leading to incomplete results. Even the statement "what shard did that response come from" becomes ambiguous since we could be talking a part of the index (ShardX) or we could be talking about the specific physical shard/server (it came from node2). Agreed - but it could be as simple as qualifying this with "from shardX on node2". This would be quite natural if you consider that even the same query submitted again could be answered by a different set of nodes that manage the same set of shards. E.g. with two nodes {n1, n2} and 2 shards {s1,s2}, and the replication factor of 2, the selection of what shard on what node contributes to the list of results could look like this (time in the Y axis): q1 {n1:s1,n2:s2} q2 {n1:s2,n2:s1} ... All shards in the system form together the complete search space of the search index. E.g. having initially one big index I could divide it into multiple shards using MultiPassIndexSplitter, and if I combined all the shards again, using IndexMerger, I should obtain the original complete search index (modulo changed Lucene docids .. doesn't matter). I strongly believe in micro-sharding, because they are much easier to handle and replicate. Also, since we control the shards we don't have to deal with overlapping shards, which is the curse of P2P search. Prohibiting overlapping shards effectively prohibits ever merging or splitting shards online (it could only be an offline or blocking operation). Anyway, in the opaque shard model (where clients create shards, and we don't know how they partitioned them), shards would have to be non-overlapping. The opaque model means it's more difficult to support updates. IMHO it makes sense to start with a set of stricter assumptions in order to build something workable, and then relax them as we gain experience. As far as the future (allocation and rebalancing), I'm happy with a small-shard approach that avoids merging and splitting. It carries some other nice little side benefits as well. * partitioning: a method whereby we can determine the target shard ID based on a doc ID. I think we're all using partitioning the same way, but that's a narrower definition than needed. A user may partition the index, and Solr may not have the mapping of docid to shard. See above - of course this would be cool and extra convenient to users, but much more difficult to implement so that it supports updates. You've also used some slightly new terminology... "shard ID" as opposed to just shard, which reinforces the need for different terminology for the physical vs the logical. You got me ;) yes, when I say "shard" I mean the logical entity, as defined by a set of documents - physical shard I would call a replica. Now, to translate this into Solr-speak: depending on the details of the design, and the evolution of Solr, one search node could be one Solr instance that manages one shard per core. A solr core is a bit too heavyweight for a microshard thou
Re: Solr Cloud wiki and branch notes
On Fri, Jan 15, 2010 at 7:36 PM, Andrzej Bialecki wrote: > Hi, > > My 0.02 PLN on the subject ... > > Terminology > --- > First the terminology: reading your emails I have a feeling that my head is > about to explode. We have to agree on the vocabulary, otherwise we have no > hope of reaching any consensus. We not only need more standardized terminology for email, but for exact strings to put in zookeeper. > I propose the following vocabulary that has > been in use and is generally understood: > > * (global) search index: a complete collection of all indexed documents. > From a conceptual point of view, this is our complete search space. We're currently using "collection". Notice how you had to add (global) to clarify what you meant. I fear that a sentence like "what index are you querying" would need constant clarification. > * index shard: a non-overlapping part of the search index. When you get down to modeling it, this gets a little squishy and is hard to avoid using two words. Say the complete collection is covered by ShardX and ShardY. A way to model this is like so: /collection /ShardX /node1 [url=..., version=...] /node2 [url=..., version=...] /node3 [url=..., version=...] It becomes clearer that there are logical shards and physical shards. If shards are updateable, they may have different versions at different times. It may also be that all the physical shards go down, but the logical "ShardX" remains. Even the statement "what shard did that response come from" becomes ambiguous since we could be talking a part of the index (ShardX) or we could be talking about the specific physical shard/server (it came from node2). > All shards in the > system form together the complete search space of the search index. E.g. > having initially one big index I could divide it into multiple shards using > MultiPassIndexSplitter, and if I combined all the shards again, using > IndexMerger, I should obtain the original complete search index (modulo > changed Lucene docids .. doesn't matter). I strongly believe in > micro-sharding, because they are much easier to handle and replicate. Also, > since we control the shards we don't have to deal with overlapping shards, > which is the curse of P2P search. Prohibiting overlapping shards effectively prohibits ever merging or splitting shards online (it could only be an offline or blocking operation). Anyway, in the opaque shard model (where clients create shards, and we don't know how they partitioned them), shards would have to be non-overlapping. As far as the future (allocation and rebalancing), I'm happy with a small-shard approach that avoids merging and splitting. It carries some other nice little side benefits as well. > * partitioning: a method whereby we can determine the target shard ID based > on a doc ID. I think we're all using partitioning the same way, but that's a narrower definition than needed. A user may partition the index, and Solr may not have the mapping of docid to shard. You've also used some slightly new terminology... "shard ID" as opposed to just shard, which reinforces the need for different terminology for the physical vs the logical. > * search node: an application that provides search and update to one or more > shards. > > * search host: a machine that may run 1 or more search nodes. > > * Shard Manager: a component that keeps track of allocation of shards to > nodes (plus more, see below). > > Now, to translate this into Solr-speak: depending on the details of the > design, and the evolution of Solr, one search node could be one Solr > instance that manages one shard per core. A solr core is a bit too heavyweight for a microshard though. I think a single solr core really needs to be able to handle multiple shards for this to become practical. > Let's forget here about the > current distributed search component, and the current replication Heh. I think this is what is causing some of the mismatches... different starting points and different assumptions. > - they > could be useful in this design as a raw transport mechanism, but someone > else would be calling the shots (see below). Seems like we need to be flexible in allowing customers to call the shots to varying degrees. -Yonik http://www.lucidimagination.com > Architecture > > The replication and load balancing is a problem with many existing > solutions, and this one in particular reminds me strongly of the Hadoop > HDFS. In fact, early on during the development of Hadoop [1] I wondered > whether we could reuse HDFS to manage Lucene indexes instead of opaque > blocks of fixed size. It turned out to be infeasible, but the model of > Namenode/Datanode still looks useful in our case, too. > > I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper > that we could reuse in our design. The following is just a straightforward > port of the Namenode/Datanode concept. > > Let's imagine a component called ShardM
Re: Solr Cloud wiki and branch notes
On Fri, Jan 15, 2010 at 4:36 PM, Andrzej Bialecki wrote: > My 0.02 PLN on the subject ... > Polish currency seems pretty strong lately. There are a lot of good ideas for this small sum. > > Terminology > > * (global) search index > * index shard: > * partitioning: > * search node: > * search host: > * Shard Manager: > I think that these terms are excellent. > The replication and load balancing is a problem with many existing > solutions, and this one in particular reminds me strongly of the Hadoop > HDFS. In fact, early on during the development of Hadoop [1] I wondered > whether we could reuse HDFS to manage Lucene indexes instead of opaque > blocks of fixed size. It turned out to be infeasible, but the model of > Namenode/Datanode still looks useful in our case, too. > I have seen the analogy with hadoop in managing a Katta cluster. The randomized assignment provides very many of the same robustness benefits as a map-reduce architecture provides for parallel computing. > I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper > that we could reuse in our design. The following is just a straightforward > port of the Namenode/Datanode concept. > > Let's imagine a component called ShardManager that is responsible for > managing the following data: > > * list of shard ID-s that together form the complete search index, > * for each shard ID, list of search nodes that serve this shard. > * issuing replication requests > * maintaining the partitioning function (see below), so that updates are > directed to correct shards > * maintaining heartbeat to check for dead nodes > * providing search clients with a list of nodes to query in order to obtain > all results from the search index. > I think that this is close. I think that the list of search nodes that serve each shard should be maintained by the nodes themselves. Moreover, ZK provides the ability to have this list magically update if the node dies. This means that the need for heartbeats virtually disappears. In addition, I think that a substrate like ZK should be used to provide search clients with the information about which nodes have which shards and the clients should themselves decide how to cover the set of shards with a list of nodes. This means that the ShardManager is *completely* out of the real-time pathway. > ... I believe most of the above functionality could be facilitated by > Zookeeper, including the election of the node that runs the ShardManager. > Absolutely. > Updates > --- > We need a partitioning schema that splits documents more or less evenly > among shards, and at the same time allows us to split or merge unbalanced > shards. The simplest function that we could imagine is the following: > >hash(docId) % numShards > > though this has the disadvantage that any larger update will affect > multiple shards, thus creating an avalanche of replication requests ... so a > sequential model would be probably better, where ranges of docIds are > assigned to shards. > A hybrid is quite possible: hash(floor(docId / sequence-size)) % numShards this gives sequential assignment of sequence-size documents at a time. Sequence-size should be small to distribute query results and update loads across all nodes. Sequence size should be large to avoid replication of all shards after a focussed update. Balance is necessary. > Now, if any particular shard is too unbalanced, e.g. too large, it could be > further split in two halves, and the ShardManager would have to record this > exception. This is a very similar process to a region split in HBase, or a > page split in btree DBs. Conversely, shards that are too small could be > joined. This is the icing on the cake, so we can leave it for later. > Leaving for later is a great idea. With relatively small shards, I am able to parallelize indexing to the point that a terabyte or so of documents index in a few hours. Combined with a small sequence-size in the shard distribution function so that all shards grow together, it is easy to plan for 3x growth or more without the need to shard splitting. With a complete index being so cheap, I can afford to simply reindex from scratch with a different shard count if I feel like it. > Search > -- > There should be a component sometimes referred to as query integrator (or > search front-end) that is the entry and exit point for user search requests. > On receiving a search request this component gets a list of randomly > selected nodes from SearchManager to contact (the list containing all shards > that form the global index), sends the query and integrates partial results > (under a configurable policy for timeouts/early termination), and sends back > the assembled results to the user. > Yes in outline. A few details: I think that the shard cover computation should, in fact, be done on the client side. One reason is that the node/shard state is relatively static and if all clients retrieve the f
Re: Solr Cloud wiki and branch notes
Hi, My 0.02 PLN on the subject ... Terminology --- First the terminology: reading your emails I have a feeling that my head is about to explode. We have to agree on the vocabulary, otherwise we have no hope of reaching any consensus. I propose the following vocabulary that has been in use and is generally understood: * (global) search index: a complete collection of all indexed documents. From a conceptual point of view, this is our complete search space. * index shard: a non-overlapping part of the search index. All shards in the system form together the complete search space of the search index. E.g. having initially one big index I could divide it into multiple shards using MultiPassIndexSplitter, and if I combined all the shards again, using IndexMerger, I should obtain the original complete search index (modulo changed Lucene docids .. doesn't matter). I strongly believe in micro-sharding, because they are much easier to handle and replicate. Also, since we control the shards we don't have to deal with overlapping shards, which is the curse of P2P search. * partitioning: a method whereby we can determine the target shard ID based on a doc ID. * search node: an application that provides search and update to one or more shards. * search host: a machine that may run 1 or more search nodes. * Shard Manager: a component that keeps track of allocation of shards to nodes (plus more, see below). Now, to translate this into Solr-speak: depending on the details of the design, and the evolution of Solr, one search node could be one Solr instance that manages one shard per core. Let's forget here about the current distributed search component, and the current replication - they could be useful in this design as a raw transport mechanism, but someone else would be calling the shots (see below). Architecture The replication and load balancing is a problem with many existing solutions, and this one in particular reminds me strongly of the Hadoop HDFS. In fact, early on during the development of Hadoop [1] I wondered whether we could reuse HDFS to manage Lucene indexes instead of opaque blocks of fixed size. It turned out to be infeasible, but the model of Namenode/Datanode still looks useful in our case, too. I believe there are many useful lessons lurking in Hadoop/HBase/Zookeeper that we could reuse in our design. The following is just a straightforward port of the Namenode/Datanode concept. Let's imagine a component called ShardManager that is responsible for managing the following data: * list of shard ID-s that together form the complete search index, * for each shard ID, list of search nodes that serve this shard. * issuing replication requests * maintaining the partitioning function (see below), so that updates are directed to correct shards * maintaining heartbeat to check for dead nodes * providing search clients with a list of nodes to query in order to obtain all results from the search index. Whenever a new search node comes up, it reports its local shard ID-s (versioned) to the ShardManager. Based on these reports from the currently active nodes, the ShardManager builds this mapping of shards to nodes, and requests replication if some shards are too old, or if the replication count is too low, allocating these shards to selected nodes (based on a policy of some kind). I believe most of the above functionality could be facilitated by Zookeeper, including the election of the node that runs the ShardManager. Updates --- We need a partitioning schema that splits documents more or less evenly among shards, and at the same time allows us to split or merge unbalanced shards. The simplest function that we could imagine is the following: hash(docId) % numShards though this has the disadvantage that any larger update will affect multiple shards, thus creating an avalanche of replication requests ... so a sequential model would be probably better, where ranges of docIds are assigned to shards. Now, if any particular shard is too unbalanced, e.g. too large, it could be further split in two halves, and the ShardManager would have to record this exception. This is a very similar process to a region split in HBase, or a page split in btree DBs. Conversely, shards that are too small could be joined. This is the icing on the cake, so we can leave it for later. After commit, a node contacts the ShardManager to report a new version of the shard. ShardManager issues replication requests to other nodes that hold a replica of this shard. Search -- There should be a component sometimes referred to as query integrator (or search front-end) that is the entry and exit point for user search requests. On receiving a search request this component gets a list of randomly selected nodes from SearchManager to contact (the list containing all shards that form the global index), sends the query and integrate
Re: Solr Cloud wiki and branch notes
> This is really about doing not-so-much in the very near term, > while thinking ahead to the longer term. Lets have a page dedicated to release 1.0 of cloud? I feel uncomfortable editing the existing wiki because I don't know what the plans are for the first release. I need to revisit Katta as my short term plans include using Zookeeper (not for failover) but simply for deploying shards/cores to servers, and nothing else. I can use the core admin interface to bring them online, update them etc. Or I'll just implement something and make a patch to Solr... Thinking out loud: /anyname/shardlist-v1.txt /anyname/shardlist-v2.txt where shardlist-v1.txt contains: corename,coredownloadpath,instanceDir Where coredownloadpath can be any URL including hftp, hdfs, ftp, http, https. Where the system automagically uninstalls cores that should no longer exist on a given server. Cores with the same name deployed to the same server would use the reload command, otherwise the create command. Where there's a ZK listener on the /anyname directory for new files that are greater than the last known installed shardlist.txt. Alternatively, an even simpler design would be uploading a solr.xml file per server, something like: /anyname/solr-prod01.solr.xml Which a directory listener on each server parses and makes the necessary changes (without restarting Tomcat). On the search side in this system, I'd need to wait for the cores to complete their install, then swap in a new core on the search proxy that represents the new version of the corelist, then the old cores could go away. This isn't very different than the segmentinfos system used in Lucene IMO. On Fri, Jan 15, 2010 at 1:53 PM, Yonik Seeley wrote: > On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen > wrote: >> The page is huge, which signals to me maybe we're trying to do >> too much > > This is really about doing not-so-much in the very near term, while > thinking ahead to the longer term. > >> Revamping distributed search could be in a different branch >> (this includes partial results) > > That could just be a separate patch - it's scope is not that broad (I > think there may already be a JIRA issue open for it). > >> Having a single solrconfig and schema for each core/shard in a >> collection won't work for me. I need to define each core >> externally, and I don't want Solr-Cloud to manage this, how will >> this scenario work? > > We do plan on each core being able to have it's own schema (so one > could try out a version of a schema and gradually migrate the > cluster). > > It could also be possible to define a schema as "local" (i.e. use the > one on the local file system) > >> A host is about the same as node, I don't see the difference, or >> enough of one > > A host is the hardware. It will have limited disk, limited CPU, etc. > At some point we will want to model this... multiple nodes could be > launched on one box. We're not doing anything with it now, and won't > in the near future. > >> Cluster resizing and rebalancing can and should be built >> externally and hopefully after an initial release that does the >> basics well > > The initial release will certainly not be doing any resizing or rebalancing. > We should allow this to be done externally. In the future, we > shouldn't require that this be done externally though (i.e. we should > somehow alow the cluster to grow w/o people having to write code). > >> Collection is a group of cores? > > A collection of documents - the complete search index. It has a > single schema, etc. > > -Yonik > http://www.lucidimagination.com >
Re: Solr Cloud wiki and branch notes
On Fri, Jan 15, 2010 at 4:12 PM, Jason Rutherglen wrote: > The page is huge, which signals to me maybe we're trying to do > too much This is really about doing not-so-much in the very near term, while thinking ahead to the longer term. > Revamping distributed search could be in a different branch > (this includes partial results) That could just be a separate patch - it's scope is not that broad (I think there may already be a JIRA issue open for it). > Having a single solrconfig and schema for each core/shard in a > collection won't work for me. I need to define each core > externally, and I don't want Solr-Cloud to manage this, how will > this scenario work? We do plan on each core being able to have it's own schema (so one could try out a version of a schema and gradually migrate the cluster). It could also be possible to define a schema as "local" (i.e. use the one on the local file system) > A host is about the same as node, I don't see the difference, or > enough of one A host is the hardware. It will have limited disk, limited CPU, etc. At some point we will want to model this... multiple nodes could be launched on one box. We're not doing anything with it now, and won't in the near future. > Cluster resizing and rebalancing can and should be built > externally and hopefully after an initial release that does the > basics well The initial release will certainly not be doing any resizing or rebalancing. We should allow this to be done externally. In the future, we shouldn't require that this be done externally though (i.e. we should somehow alow the cluster to grow w/o people having to write code). > Collection is a group of cores? A collection of documents - the complete search index. It has a single schema, etc. -Yonik http://www.lucidimagination.com
Solr Cloud wiki and branch notes
Here's some rough notes after running the unit tests, reviewing some of the code (though not understanding it), and reviewing the wiki page http://wiki.apache.org/solr/SolrCloud We need a protocol in the URL, otherwise it's inflexible I'm overwhelmed with all the ?? question areas of the document. The page is huge, which signals to me maybe we're trying to do too much Revamping distributed search could be in a different branch (this includes partial results) Having a single solrconfig and schema for each core/shard in a collection won't work for me. I need to define each core externally, and I don't want Solr-Cloud to manage this, how will this scenario work? A host is about the same as node, I don't see the difference, or enough of one Cluster resizing and rebalancing can and should be built externally and hopefully after an initial release that does the basics well Collection is a group of cores? I like the model -> reality system. However how does the versioning work? We need to know what the conversion progress is? How will the queuing of in-progress alterations work (this seems hard, I'd rather focus on this, make it work well, than mess with other things like load balancing in the first release? i.e. if this doesn't work well, Solr-Cloud isn't production ready for me) Shard Identification, this falls under too ambitious right now IMO I think we need a wiki page of just the basics of core/shard management, implement that, then build all the rest of the features on top... Otherwise this thing feels like it's going to be a nightmare to test and deploy in production.