Re: SolrCloud - Using collections, slices and shards in the wild

Jon Gifford Wed, 10 Feb 2010 09:24:10 -0800

By the way, and in case its not obvious, I don't mean to suggest that
we should remove the ability to specify a set of shards in the search
interface. What I'm saying is that using something like "group"
instead is simpler in most cases.


Jon

On Wed, Feb 10, 2010 at 9:14 AM, Jon Gifford <jon.giff...@gmail.com> wrote:
> On Wed, Feb 10, 2010 at 8:04 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Wed, Feb 10, 2010 at 12:02 AM, Jon Gifford <jon.giff...@gmail.com> wrote:
>>> Alternatively, I could create a collection per customer, which removes
>>> the need for slices, but means duplicating the schema many times.
>>
>> Multiple collections should be able to share a single config (schema
>> and related config files).
>
> OK, this solves the top level problem (how do I manage a single
> customers index, while guaranteeing that all indices have the same
> schema), which is good.
>
>> Note: I've backed off of the use of "slice" in the public APIs since
>> it was contentious (although I still think it's a useful concept and
>> it does remain in some of the code).  "shard" is kind of ambiguous,
>> but people are pretty good at dealing with ambiguity (and removing
>> that ambiguity by introducing another term seemed to add more
>> perceived complexity).
>
> I agree that its a very useful concept, and wonder how much of the
> contention is just a terminology issue? If we used subcollection
> instead, the intent becomes clearer for some use-cases. If we used
> tag, or taggroup, then a slightly different (more powerful?) intent is
> suggested.
>
>>> The second part of what I need is to be able to search a single
>>> customers index, which I'm assuming will be a slice. Something like:
>>>
>>>    
>>> http://localhost:8983/solr/collection1/select?distrib=true&slice=customer_1
>>
>> The URLs on the SolrCloud page have been updated - this would now be
>> http://localhost:8983/solr/collection1/select?distrib=true&shards=customer_1
>>
>> This will work as long as no customer becomes bigger than a shard.  If
>> that's not the case, you could query the entire collection and filter
>> on customer_1, or create a collection per customer (or do both, if you
>> mave many small customers that you want to pack in a single shard).
>
> right. I'd most likely default to using a collection per customer
> (assuming that collections can share a single config) because a single
> customers index will be larger than a single shard.
>
>>
>> http://localhost:8983/solr/collection1/select?distrib=true&collection=customer_1
>>
>>> Reading over some of the previous discussions, slices seem to be
>>> somewhat contentious, and I wanted to chime in on them a bit here. It
>>> seems to me that slices are loosely defined, and I think thats a good
>>> thing. If you think of slices as being similar to tags, then its easy
>>> to imagine that any given shard can belong to many different slices.
>>
>> I wouldn't call it a "slice" but I've also been thinking about how to
>> select groups of nodes.
>> Extending that to shards would also make sense.
>
> I think the important points here are that if there is the concept of
> a group (or slice or subcollection or tag - whatever terminology we
> end up using), then
>
> 1)  the client (typically some front end code) can use a simpler
> interface, which I think is a good thing. Solr doesn't need to expose
> how many shards there really are, or what they're named, and the FE
> doesn't have to try and generate a list of shard id's just to do a
> search.
>
> 2) Some piece of code has to decide what shards to actually search,
> and that piece of code has to know exactly what shards actually exist.
> If that decision is made in the client, then it has to be made in
> every client (your customer-facing search interface, any and all
> background tasks you have running, any ad-hoc searches you do for
> analysis or spot checking or...). For the sake of simplicity and
> sanity, you don't want to have to replicate that decision making code
> across multiple apps or languages.
>
> 3) the collection and shard entities are at opposite ends of a fairly
> wide divide, and there are cases where you need something
> "in-between".
>
> In most cases, a simple collection search will suffice, but in those
> cases where you want to limit the search to particular shards, it
> makes more sense to me to manage that set of shards within solr, and
> expose only the fact that the "groups" are available.
>
> Here's another example:
>
> Lets say you're generating hourly shards, to limit the maximum size of
> the shard that is taking updates, for performance reasons. Lets also
> assume that you want to roll those hourlies up into daily or weekly or
> maximum size shards once they become less active, so Solr isn't trying
> to search 24 shards to get a single days worth of results. If the
> "group" concept exists, then you can hide all of the mechanics of how
> and when that happens from the client, while still allowing it to have
> some control over how far back it can search, by exposing "groups"
> that limit it to the last day or week or whatever makes sense for your
> app.
>
> cheers
>
> Jon
>
>
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>

Re: SolrCloud - Using collections, slices and shards in the wild

Reply via email to