Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 2:43 PM, Yonik Seeley wrote: > On Thu, Jan 14, 2010 at 1:58 PM, Chris Hostetter > wrote: >> : parameter we use for this. Suggestions? logicalshards=shard1,shard2? >> : lshards=shard1,shard2? slice=shard1,shard2? It doesn't seem like it >> : would be easy to reuse the "shards" parameter for this since it refers >> : to physical shard addresses. >> >> I haven't been following the SolrCloud stuff much, but from a client >> perspective is there really any difference between asking for a physical >> shard, vs asking for a logical shard (or slice name)? ... shouldn't the >> later case just result in a resolution from logical->physical w/o >> requiring the client code to know/care wether the String they have is a >> physical shard URL, or a slice name. > > That might be doable... but we would need to be able to tell the difference. > Perhaps we could always require a slash in a physical address > (localhost/context) and prohibit it in slice names? > > But... I think there's still a potentially bigger difference: today, > if shards is set, it means it's a distributed search (and shards is > removed for sub-requests). But the slice of the index being requested > may not have a one-to-one mapping with a full request on a solr core. > And shards may be able to move around, and so it seems important to be > able to declare what part of the index you're looking for when you're > querying a shard. If we want to go this route for parameters (allowing use of both physical or logical shards in the shards param), I've updated the wiki with one way to do it: """ The presence of "shards" is what currently signals that a request is distributed, and distrib search removes this param for sub-requests. But with future micro-sharding or having a single core support multiple shards, the request will need to contain what shards are being requested. Reusing "shards" for this (per Hoss' suggestion) by allowing either physical urls or logical shards (slices) would require that either * a) The search component detect when it has all of the shards requested, and turn it into a non-distributed request (any error here could easily result in an infinite request loop until deadlock). It seems better to return a specific error if this node no longer contains the shard being queried in a non-distrib search. * b) Use a different distrib=true flag to indicate if this is a distributed search. This isn't back compatible though? Unless we also consider any request where shards contains a url to be distributed. http://localhost:8983/solr/collection1/select?shards=shard_200911,shard_200912,shard_201001&distrib=true If we adopt "distrib=true" then it should replace "shards=auto" in the other example URLs """ So the top-level distributed request shown above would resolve to potentially multiple sub-requests of the form http://localhost:1234/solr/collection1/select?shards=shard_200911 (note, distrib=true has been removed) http://localhost:1235/solr/collection1/select?shards=shard_200912 http://localhost:1236/solr/collection1/select?shards=shard_201001 -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
> The point I was trying to make is that I believe that if you start changing > terminologies now people will be very confused So shard -> remote core... Slice -> core group. Though semantically they're synonyms. In any case, I need to spend some time looking at the cloud branch, and less time jibber-jabberin' about it. On Fri, Jan 15, 2010 at 1:24 AM, Uri Boness wrote: >> >> Can you elaborate on what you mean, isn't a core a single index >> too? It seems like shard was used to represent a remote index >> (perhaps?). > > Yes, a core is a single index and a shard is a conceptual idea which at the > moment concretely refers to a remote core (but not a specific one as the > same shard can be represented by multiple core replicas). The point I was > trying to make is that I believe that if you start changing terminologies > now people will be very confused. And I thought of sticking to Yonik's > suggestion of a "slice" just to prevent this confusion. On the other hand > one can argue that the terminology as it is today is already confusing... > and if you really want to get it right and be aligned with the "rest of the > world" (if there is such a thing... from what I've seen so far sharding is > used differently in different contexts), then perhaps a "good" timing for > making such terminology changes is with a major release (Solr 2.0?) as with > such release people tend to be more open for new/changed concepts. > > Cheers, > Uri > > Jason Rutherglen wrote: >> >> Uri, >> >> >>> >>> "core" to represent a single index and "shard" to be >>> represented by a single core >>> >> >> Can you elaborate on what you mean, isn't a core a single index >> too? It seems like shard was used to represent a remote index >> (perhaps?). Though here I'd prefer "remote core", because to the >> uninitiated Solr outsider it's immediately obvious (i.e. they >> need only know what a core is, in the Solr glossary or term >> dictionary). >> >> In Google vernacular, which is where the name shard came from, a >> "shard" is basically a local sub-index >> http://research.google.com/archive/googlecluster.html where >> there would be many "shards" per server. However that's a >> digression at this point. >> >> I personally prefer relatively straightforward names, that are >> self-evident, rather than inventing new language for fairly >> simple concepts. Slice, even though it comes from our buddy >> Yonik, probably doesn't make any immediate sense to external >> users when compared with the word shard. Of course software >> projects have a tendency to create their own words to somewhat >> mystify users into believing in some sort of magic occurring >> underneath. If that's what we're after, it's cool, I mean that >> makes sense. And I don't mean to be derogatory here however this >> is an open source project created in part to educate users on >> search and be made easily accessible as possible, to the >> greatest number of users possible. I think Doug did a create job >> of this when Lucene started with amazingly succinct code for >> fairly complex concepts (eg, anti-mystification of search). >> >> Jason >> >> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: >> >>> >>> Although Jason has some valid points here, I'm with Yonik here. I do >>> believe >>> that we've gotten used to the terms "core" to represent a single index >>> and >>> "shard" to be represented by a single core. A "node" seems to indicate a >>> machine or a JVM. Changing any of these (informal perhaps) definitions >>> will >>> only cause confusion. That's why I think a "slice" is a good solution >>> now... >>> first it's a new term to a new view of the index (logical shard AFAIK >>> don't >>> really exists yet) so people won't need to get used to it, but it's also >>> descriptive and intuitive. I do like Jason's idea about having a protocol >>> attached to the URL's. >>> >>> Cheers, >>> Uri >>> >>> Jason Rutherglen wrote: >>> > > But I've kind of gotten used to thinking of shards as the > actual physical queryable things... > > I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want t
Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 1:38 PM, Ted Dunning wrote: > I think that most of these complications go away to a remarkable degree if > you combine katta style random assignment of small shards. > > The major simplifications there include: > > - no need to move individual documents, nor to split or merge shards, no > need for search-server to search-server communications Yeah, keeping shards smaller allows cluster growth (to some degree) w/o getting into shard splitting. Until a single core can handle multiple shards though, this isn't too practical. While I think we should eventually support this model, I don't think we want to limit ourselves to it. The idea is to also support the type of cluster architectures that people have today. And yes, I think that does cause complications :-) -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
Can you elaborate on what you mean, isn't a core a single index too? It seems like shard was used to represent a remote index (perhaps?). Yes, a core is a single index and a shard is a conceptual idea which at the moment concretely refers to a remote core (but not a specific one as the same shard can be represented by multiple core replicas). The point I was trying to make is that I believe that if you start changing terminologies now people will be very confused. And I thought of sticking to Yonik's suggestion of a "slice" just to prevent this confusion. On the other hand one can argue that the terminology as it is today is already confusing... and if you really want to get it right and be aligned with the "rest of the world" (if there is such a thing... from what I've seen so far sharding is used differently in different contexts), then perhaps a "good" timing for making such terminology changes is with a major release (Solr 2.0?) as with such release people tend to be more open for new/changed concepts. Cheers, Uri Jason Rutherglen wrote: Uri, "core" to represent a single index and "shard" to be represented by a single core Can you elaborate on what you mean, isn't a core a single index too? It seems like shard was used to represent a remote index (perhaps?). Though here I'd prefer "remote core", because to the uninitiated Solr outsider it's immediately obvious (i.e. they need only know what a core is, in the Solr glossary or term dictionary). In Google vernacular, which is where the name shard came from, a "shard" is basically a local sub-index http://research.google.com/archive/googlecluster.html where there would be many "shards" per server. However that's a digression at this point. I personally prefer relatively straightforward names, that are self-evident, rather than inventing new language for fairly simple concepts. Slice, even though it comes from our buddy Yonik, probably doesn't make any immediate sense to external users when compared with the word shard. Of course software projects have a tendency to create their own words to somewhat mystify users into believing in some sort of magic occurring underneath. If that's what we're after, it's cool, I mean that makes sense. And I don't mean to be derogatory here however this is an open source project created in part to educate users on search and be made easily accessible as possible, to the greatest number of users possible. I think Doug did a create job of this when Lucene started with amazingly succinct code for fairly complex concepts (eg, anti-mystification of search). Jason On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: Although Jason has some valid points here, I'm with Yonik here. I do believe that we've gotten used to the terms "core" to represent a single index and "shard" to be represented by a single core. A "node" seems to indicate a machine or a JVM. Changing any of these (informal perhaps) definitions will only cause confusion. That's why I think a "slice" is a good solution now... first it's a new term to a new view of the index (logical shard AFAIK don't really exists yet) so people won't need to get used to it, but it's also descriptive and intuitive. I do like Jason's idea about having a protocol attached to the URL's. Cheers, Uri Jason Rutherglen wrote: But I've kind of gotten used to thinking of shards as the actual physical queryable things... I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want to use HTTPS for example?). So there could be http://host/solr/core1 (physical), core://megacorename (logical), coregroup://supergreatcoregroupname (a group of cores) in the "shards" parameter (whose name can perhaps be changed for clarity in a future release). Then people can mix and match and we won't have many different XML elements floating around. We'd have a simple list of URLs that are transposed into a real physical network request. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley wrote: On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley wrote: On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley wrote: I'm actually starting to lean toward "slice" i
Re: SolrCloud logical shards
My definition of right is simple and modularized with minimal conceptual upheaval. As such, simply giving SOLR a good shard manager that broadcasts queries without respect to content is a preferable solution than something very fancy. On Thu, Jan 14, 2010 at 4:31 PM, Lance Norskog wrote: > Logical-to-physical mapping should not assume that the logical has an > integral number of the physical. Overlapping and partial physical > shards should be addressable as a logical shard. If you're going to do > something this major, do it right. > -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
Yonik spake- I'm actually starting to lean toward "slice" instead of "logical shard". In the future we'll want to enable overlapping shards I think (due to an Amazon Dynamo type of replication, or due to merging shards, etc),v and a separate word for a logical slice of the index seems desirable. For instance, one could specify slice=1000-1999 (defined by the ids or hashcodes of the ids) and that could end up querying multiple servers. For this first iteration, slices would just be opaque identifiers though (and that functionality would always remain, allowing for user partitioning by time or by geo region). +1 Logical-to-physical mapping should not assume that the logical has an integral number of the physical. Overlapping and partial physical shards should be addressable as a logical shard. If you're going to do something this major, do it right. On Thu, Jan 14, 2010 at 3:29 PM, Ted Dunning wrote: > Shard has the interesting additional implication that it is part of a > composite index made up of many sub-indexes. > > A lucene index could be a complete index or a shard. I would presume the > same of what might be called a core. > > On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> Uri, >> >> > "core" to represent a single index and "shard" to be >> > represented by a single core >> >> Can you elaborate on what you mean, isn't a core a single index >> too? It seems like shard was used to represent a remote index >> (perhaps?). Though here I'd prefer "remote core", because to the >> uninitiated Solr outsider it's immediately obvious (i.e. they >> need only know what a core is, in the Solr glossary or term >> dictionary). >> >> In Google vernacular, which is where the name shard came from, a >> "shard" is basically a local sub-index >> http://research.google.com/archive/googlecluster.html where >> there would be many "shards" per server. However that's a >> digression at this point. >> >> I personally prefer relatively straightforward names, that are >> self-evident, rather than inventing new language for fairly >> simple concepts. Slice, even though it comes from our buddy >> Yonik, probably doesn't make any immediate sense to external >> users when compared with the word shard. Of course software >> projects have a tendency to create their own words to somewhat >> mystify users into believing in some sort of magic occurring >> underneath. If that's what we're after, it's cool, I mean that >> makes sense. And I don't mean to be derogatory here however this >> is an open source project created in part to educate users on >> search and be made easily accessible as possible, to the >> greatest number of users possible. I think Doug did a create job >> of this when Lucene started with amazingly succinct code for >> fairly complex concepts (eg, anti-mystification of search). >> >> Jason >> >> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: >> > Although Jason has some valid points here, I'm with Yonik here. I do >> believe >> > that we've gotten used to the terms "core" to represent a single index >> and >> > "shard" to be represented by a single core. A "node" seems to indicate a >> > machine or a JVM. Changing any of these (informal perhaps) definitions >> will >> > only cause confusion. That's why I think a "slice" is a good solution >> now... >> > first it's a new term to a new view of the index (logical shard AFAIK >> don't >> > really exists yet) so people won't need to get used to it, but it's also >> > descriptive and intuitive. I do like Jason's idea about having a protocol >> > attached to the URL's. >> > >> > Cheers, >> > Uri >> > >> > Jason Rutherglen wrote: >> >>> >> >>> But I've kind of gotten used to thinking of shards as the >> >>> actual physical queryable things... >> >>> >> >> >> >> I think a mistake was made referring to Solr cores as shards. >> >> It's the same thing with 2 different names. Slices adds yet >> >> another name which seems to imply the same thing yet again. I'd >> >> rather see disambiguation here, and call them cores (partially >> >> because that's what's in the code and on the wiki), and cores >> >> only. It's a Solr specific term, it's going to be confused with >> >> microprocessor cores, but at least there's only one name, which >> >> as search people, we know creates fewer posting lists :). >> >> >> >> Logical groupings of cores can occur, which can be aptly named >> >> core groups. This way I can submit a query to a core group, and >> >> it's reasonable to assume I'm hitting N cores. Further, cores >> >> could point to a logical or physical entity via a URL. (As a >> >> side note, I've always found it odd that the shards param to >> >> RequestHandler lacks the protocol, what if I want to use HTTPS >> >> for example?). >> >> >> >> So there could be http://host/solr/core1 (physical), >> >> core://megacorename (logical), >> >> coregroup://supergreatcoregroupname (a group of cores) in the >> >>
Re: SolrCloud logical shards
Shard has the interesting additional implication that it is part of a composite index made up of many sub-indexes. A lucene index could be a complete index or a shard. I would presume the same of what might be called a core. On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Uri, > > > "core" to represent a single index and "shard" to be > > represented by a single core > > Can you elaborate on what you mean, isn't a core a single index > too? It seems like shard was used to represent a remote index > (perhaps?). Though here I'd prefer "remote core", because to the > uninitiated Solr outsider it's immediately obvious (i.e. they > need only know what a core is, in the Solr glossary or term > dictionary). > > In Google vernacular, which is where the name shard came from, a > "shard" is basically a local sub-index > http://research.google.com/archive/googlecluster.html where > there would be many "shards" per server. However that's a > digression at this point. > > I personally prefer relatively straightforward names, that are > self-evident, rather than inventing new language for fairly > simple concepts. Slice, even though it comes from our buddy > Yonik, probably doesn't make any immediate sense to external > users when compared with the word shard. Of course software > projects have a tendency to create their own words to somewhat > mystify users into believing in some sort of magic occurring > underneath. If that's what we're after, it's cool, I mean that > makes sense. And I don't mean to be derogatory here however this > is an open source project created in part to educate users on > search and be made easily accessible as possible, to the > greatest number of users possible. I think Doug did a create job > of this when Lucene started with amazingly succinct code for > fairly complex concepts (eg, anti-mystification of search). > > Jason > > On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: > > Although Jason has some valid points here, I'm with Yonik here. I do > believe > > that we've gotten used to the terms "core" to represent a single index > and > > "shard" to be represented by a single core. A "node" seems to indicate a > > machine or a JVM. Changing any of these (informal perhaps) definitions > will > > only cause confusion. That's why I think a "slice" is a good solution > now... > > first it's a new term to a new view of the index (logical shard AFAIK > don't > > really exists yet) so people won't need to get used to it, but it's also > > descriptive and intuitive. I do like Jason's idea about having a protocol > > attached to the URL's. > > > > Cheers, > > Uri > > > > Jason Rutherglen wrote: > >>> > >>> But I've kind of gotten used to thinking of shards as the > >>> actual physical queryable things... > >>> > >> > >> I think a mistake was made referring to Solr cores as shards. > >> It's the same thing with 2 different names. Slices adds yet > >> another name which seems to imply the same thing yet again. I'd > >> rather see disambiguation here, and call them cores (partially > >> because that's what's in the code and on the wiki), and cores > >> only. It's a Solr specific term, it's going to be confused with > >> microprocessor cores, but at least there's only one name, which > >> as search people, we know creates fewer posting lists :). > >> > >> Logical groupings of cores can occur, which can be aptly named > >> core groups. This way I can submit a query to a core group, and > >> it's reasonable to assume I'm hitting N cores. Further, cores > >> could point to a logical or physical entity via a URL. (As a > >> side note, I've always found it odd that the shards param to > >> RequestHandler lacks the protocol, what if I want to use HTTPS > >> for example?). > >> > >> So there could be http://host/solr/core1 (physical), > >> core://megacorename (logical), > >> coregroup://supergreatcoregroupname (a group of cores) in the > >> "shards" parameter (whose name can perhaps be changed for > >> clarity in a future release). Then people can mix and match and > >> we won't have many different XML elements floating around. We'd > >> have a simple list of URLs that are transposed into a real > >> physical network request. > >> > >> > >> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley > >> wrote: > >> > >>> > >>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley > >>> wrote: > >>> > > On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley > wrote: > > > > > I'm actually starting to lean toward "slice" instead of "logical > > shard". > > > >>> > >>> Alternate terminology could be "index" for the actual physical lucene > >>> lindex (and also enough of the URL that unambiguously identifies it), > >>> and then "shard" could be the logical entity. > >>> > >>> But I've kind of gotten used to thinking of shards as the actual > >>> physical queryable things... > >>> > >>> -Yonik > >>> http://www.lucidimagination.com > >>> > >>> > >> > >> > >
Re: SolrCloud logical shards
Uri, > "core" to represent a single index and "shard" to be > represented by a single core Can you elaborate on what you mean, isn't a core a single index too? It seems like shard was used to represent a remote index (perhaps?). Though here I'd prefer "remote core", because to the uninitiated Solr outsider it's immediately obvious (i.e. they need only know what a core is, in the Solr glossary or term dictionary). In Google vernacular, which is where the name shard came from, a "shard" is basically a local sub-index http://research.google.com/archive/googlecluster.html where there would be many "shards" per server. However that's a digression at this point. I personally prefer relatively straightforward names, that are self-evident, rather than inventing new language for fairly simple concepts. Slice, even though it comes from our buddy Yonik, probably doesn't make any immediate sense to external users when compared with the word shard. Of course software projects have a tendency to create their own words to somewhat mystify users into believing in some sort of magic occurring underneath. If that's what we're after, it's cool, I mean that makes sense. And I don't mean to be derogatory here however this is an open source project created in part to educate users on search and be made easily accessible as possible, to the greatest number of users possible. I think Doug did a create job of this when Lucene started with amazingly succinct code for fairly complex concepts (eg, anti-mystification of search). Jason On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness wrote: > Although Jason has some valid points here, I'm with Yonik here. I do believe > that we've gotten used to the terms "core" to represent a single index and > "shard" to be represented by a single core. A "node" seems to indicate a > machine or a JVM. Changing any of these (informal perhaps) definitions will > only cause confusion. That's why I think a "slice" is a good solution now... > first it's a new term to a new view of the index (logical shard AFAIK don't > really exists yet) so people won't need to get used to it, but it's also > descriptive and intuitive. I do like Jason's idea about having a protocol > attached to the URL's. > > Cheers, > Uri > > Jason Rutherglen wrote: >>> >>> But I've kind of gotten used to thinking of shards as the >>> actual physical queryable things... >>> >> >> I think a mistake was made referring to Solr cores as shards. >> It's the same thing with 2 different names. Slices adds yet >> another name which seems to imply the same thing yet again. I'd >> rather see disambiguation here, and call them cores (partially >> because that's what's in the code and on the wiki), and cores >> only. It's a Solr specific term, it's going to be confused with >> microprocessor cores, but at least there's only one name, which >> as search people, we know creates fewer posting lists :). >> >> Logical groupings of cores can occur, which can be aptly named >> core groups. This way I can submit a query to a core group, and >> it's reasonable to assume I'm hitting N cores. Further, cores >> could point to a logical or physical entity via a URL. (As a >> side note, I've always found it odd that the shards param to >> RequestHandler lacks the protocol, what if I want to use HTTPS >> for example?). >> >> So there could be http://host/solr/core1 (physical), >> core://megacorename (logical), >> coregroup://supergreatcoregroupname (a group of cores) in the >> "shards" parameter (whose name can perhaps be changed for >> clarity in a future release). Then people can mix and match and >> we won't have many different XML elements floating around. We'd >> have a simple list of URLs that are transposed into a real >> physical network request. >> >> >> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley >> wrote: >> >>> >>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley >>> wrote: >>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley wrote: > > I'm actually starting to lean toward "slice" instead of "logical > shard". > >>> >>> Alternate terminology could be "index" for the actual physical lucene >>> lindex (and also enough of the URL that unambiguously identifies it), >>> and then "shard" could be the logical entity. >>> >>> But I've kind of gotten used to thinking of shards as the actual >>> physical queryable things... >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >> >> >
Re: SolrCloud logical shards
Although Jason has some valid points here, I'm with Yonik here. I do believe that we've gotten used to the terms "core" to represent a single index and "shard" to be represented by a single core. A "node" seems to indicate a machine or a JVM. Changing any of these (informal perhaps) definitions will only cause confusion. That's why I think a "slice" is a good solution now... first it's a new term to a new view of the index (logical shard AFAIK don't really exists yet) so people won't need to get used to it, but it's also descriptive and intuitive. I do like Jason's idea about having a protocol attached to the URL's. Cheers, Uri Jason Rutherglen wrote: But I've kind of gotten used to thinking of shards as the actual physical queryable things... I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want to use HTTPS for example?). So there could be http://host/solr/core1 (physical), core://megacorename (logical), coregroup://supergreatcoregroupname (a group of cores) in the "shards" parameter (whose name can perhaps be changed for clarity in a future release). Then people can mix and match and we won't have many different XML elements floating around. We'd have a simple list of URLs that are transposed into a real physical network request. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley wrote: On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley wrote: On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley wrote: I'm actually starting to lean toward "slice" instead of "logical shard". Alternate terminology could be "index" for the actual physical lucene lindex (and also enough of the URL that unambiguously identifies it), and then "shard" could be the logical entity. But I've kind of gotten used to thinking of shards as the actual physical queryable things... -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
> But I've kind of gotten used to thinking of shards as the > actual physical queryable things... I think a mistake was made referring to Solr cores as shards. It's the same thing with 2 different names. Slices adds yet another name which seems to imply the same thing yet again. I'd rather see disambiguation here, and call them cores (partially because that's what's in the code and on the wiki), and cores only. It's a Solr specific term, it's going to be confused with microprocessor cores, but at least there's only one name, which as search people, we know creates fewer posting lists :). Logical groupings of cores can occur, which can be aptly named core groups. This way I can submit a query to a core group, and it's reasonable to assume I'm hitting N cores. Further, cores could point to a logical or physical entity via a URL. (As a side note, I've always found it odd that the shards param to RequestHandler lacks the protocol, what if I want to use HTTPS for example?). So there could be http://host/solr/core1 (physical), core://megacorename (logical), coregroup://supergreatcoregroupname (a group of cores) in the "shards" parameter (whose name can perhaps be changed for clarity in a future release). Then people can mix and match and we won't have many different XML elements floating around. We'd have a simple list of URLs that are transposed into a real physical network request. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley wrote: > On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley > wrote: >> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley >> wrote: >>> I'm actually starting to lean toward "slice" instead of "logical shard". > > Alternate terminology could be "index" for the actual physical lucene > lindex (and also enough of the URL that unambiguously identifies it), > and then "shard" could be the logical entity. > > But I've kind of gotten used to thinking of shards as the actual > physical queryable things... > > -Yonik > http://www.lucidimagination.com >
Re: SolrCloud logical shards
I have found that users of the system like to use index as the composite of all nodes/shards/slices that is searched in response to a query. It is the ultimate logical entity. Really, this is the same abstraction that users of Lucene have. They really don't want to care that a Lucene index is made up of several files and even possibly several indexes in various states of merging. The same should really be true of a parallel system, but more so. On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley wrote: > On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley > wrote: > > On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley > > wrote: > >> I'm actually starting to lean toward "slice" instead of "logical shard". > > Alternate terminology could be "index" for the actual physical lucene > lindex (and also enough of the URL that unambiguously identifies it), > and then "shard" could be the logical entity. > > But I've kind of gotten used to thinking of shards as the actual > physical queryable things... > > -Yonik > http://www.lucidimagination.com > -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley wrote: > On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley > wrote: >> I'm actually starting to lean toward "slice" instead of "logical shard". Alternate terminology could be "index" for the actual physical lucene lindex (and also enough of the URL that unambiguously identifies it), and then "shard" could be the logical entity. But I've kind of gotten used to thinking of shards as the actual physical queryable things... -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 1:58 PM, Chris Hostetter wrote: > : parameter we use for this. Suggestions? logicalshards=shard1,shard2? > : lshards=shard1,shard2? slice=shard1,shard2? It doesn't seem like it > : would be easy to reuse the "shards" parameter for this since it refers > : to physical shard addresses. > > I haven't been following the SolrCloud stuff much, but from a client > perspective is there really any difference between asking for a physical > shard, vs asking for a logical shard (or slice name)? ... shouldn't the > later case just result in a resolution from logical->physical w/o > requiring the client code to know/care wether the String they have is a > physical shard URL, or a slice name. That might be doable... but we would need to be able to tell the difference. Perhaps we could always require a slash in a physical address (localhost/context) and prohibit it in slice names? But... I think there's still a potentially bigger difference: today, if shards is set, it means it's a distributed search (and shards is removed for sub-requests). But the slice of the index being requested may not have a one-to-one mapping with a full request on a solr core. And shards may be able to move around, and so it seems important to be able to declare what part of the index you're looking for when you're querying a shard. -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
: parameter we use for this. Suggestions? logicalshards=shard1,shard2? : lshards=shard1,shard2? slice=shard1,shard2? It doesn't seem like it : would be easy to reuse the "shards" parameter for this since it refers : to physical shard addresses. I haven't been following the SolrCloud stuff much, but from a client perspective is there really any difference between asking for a physical shard, vs asking for a logical shard (or slice name)? ... shouldn't the later case just result in a resolution from logical->physical w/o requiring the client code to know/care wether the String they have is a physical shard URL, or a slice name. This seems completley analogous to hostnames: - I'm an applciation. - via some means, i've got a (String) $host - I ask my networking library to open a connection to $host - the networking library worries about wether $host is a name or an IP - if $host is an alias, the DNS server resolves it to a hostname - if $host is a hostname, the DNS server resolves it to an IP (possibly round robin) Likewise in Solr: - I'm an applciation. - via some means, i've got a (Set) $shards - I ask Solr to search across $shards - Solr looks at each item in $shards - if it's the name of a slice, it picks a physical shard - if it's a physical shard, it uses that shard ...there's got to be a mapping from slice_name=>Set(physical_shards) anyway right? why should the client have to know the difference? -Hoss
Re: SolrCloud logical shards
I think that most of these complications go away to a remarkable degree if you combine katta style random assignment of small shards. The major simplifications there include: - no need to move individual documents, nor to split or merge shards, no need for search-server to search-server communications - search servers do search and nothing else - placement, balance, replication and query balancing policy is factored out of all real-time paths - real-time updates can be accommodated in the same framework with minimal changes to the shard management layer - the shard management is completely agnostic to the actual search semantics. On Thu, Jan 14, 2010 at 9:46 AM, Yonik Seeley wrote: > I'm actually starting to lean toward "slice" instead of "logical shard". > In the future we'll want to enable overlapping shards I think (due to > an Amazon Dynamo type of replication, or due to merging shards, etc), > and a separate word for a logical slice of the index seems desirable. > -- Ted Dunning, CTO DeepDyve
Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley wrote: > I'm actually starting to lean toward "slice" instead of "logical shard". I've gone with this for now and updated http://wiki.apache.org/solr/SolrCloud but it's certainly not written in stone if people want to try and come up with better naming... -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
On Thu, Jan 14, 2010 at 12:30 PM, Ted Dunning wrote: > Another concept from Katta that is AFAIK missing from the Solr lexicon is > the distinction between node and shard. In Katta, a node is a server worker > instance that contains and queries physical shards. I think it's sort of missing because a single Solr core can only support a single lucene index at this point, and we're starting with low hanging fruit. So it's still a bit up in the air if we're modeling a "node" as a single JVM webapp, or as a single solr core. I'd really like to not model the core at all and go with node and shards... but I'm not sure how well that abstraction will hold up with the reality of solr cores that's here today. The first iteration won't have automatic shard assignment at all I think. It will just be centralized configuration and automatic load balancing. Just a start, but will still make peoples lives easier. Baby steps... -Yonik http://www.lucidimagination.com
Re: SolrCloud logical shards
I'm actually starting to lean toward "slice" instead of "logical shard". In the future we'll want to enable overlapping shards I think (due to an Amazon Dynamo type of replication, or due to merging shards, etc), and a separate word for a logical slice of the index seems desirable. For instance, one could specify slice=1000-1999 (defined by the ids or hashcodes of the ids) and that could end up querying multiple servers. For this first iteration, slices would just be opaque identifiers though (and that functionality would always remain, allowing for user partitioning by time or by geo region). So "slice" would be logical, "shard" would be physical. To get a full result, one needs to query all of the slices of an index, but not necessarily all of the shards. -Yonik http://www.lucidimagination.com On Thu, Jan 14, 2010 at 12:08 PM, Yonik Seeley wrote: > The shards parameter currently references physical shards. > There's also a concept of a logical shard (i.e. all physical shards > with identical index content share the same logical shards... > sometimes what I've also called a shard replica). > Should we use logical shard for this, or does anyone have any better ideas? > > Related: it seems like we would want to enable querying of specific > logical shards (say if a user partitioned their shards by time or by > geographic region), so the terminology above could affect the > parameter we use for this. Suggestions? logicalshards=shard1,shard2? > lshards=shard1,shard2? slice=shard1,shard2? It doesn't seem like it > would be easy to reuse the "shards" parameter for this since it refers > to physical shard addresses. > > -Yonik > http://www.lucidimagination.com >
Re: SolrCloud logical shards
Logical shard sounds good as "the collection of all identical physical shards" Another concept from Katta that is AFAIK missing from the Solr lexicon is the distinction between node and shard. In Katta, a node is a server worker instance that contains and queries physical shards. There is usually one node per physical server, but not always. In Katta an important performance and reliability optimization is that nodes do not contain identical shard sets. That is, shards are assigned randomly even when replicated. This improves robustness, code simplicity and load balancing. On Thu, Jan 14, 2010 at 9:08 AM, Yonik Seeley wrote: > Should we use logical shard for this, or does anyone have any better ideas? -- Ted Dunning, CTO DeepDyve