Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
What do you mean "the rest of the cluster"? The routing is based on
the key provided. All of the "enu" prefixes will go to one of your
shards. All the "deu" docs will appear on one shard. All the "esp"
will be on one shard. All the "chs" docs will be on one shard.

Which shard will each go to? Good question. Especially when you have
small numbers of keys and/or one of the keys has a majority of your
corpus you can end up with very uneven distributions. If you require
individual control, what I'd do is create separate _collections_ for
each language, then use collection aliasing to have a single URL to
query. Of course that requires that you index to the correct
collection You could also create a collection for the language
with the most docs and one for "everything else". Or

The advantage here is that the collection can be tailored to the
number of docs. That is, the Spanish collection may be a single shard
whereas the English one may be 4 shards

But really, with a corpus this size I wouldn't worry about it. I
suspect you're over-thinking the problem.

And one addendum to Walter's comment... I often turn caching off (or
wy down) when doing perf testing if I can't mine logs for, say,
100K queries in an attempt to negate effects of caching, but that
doesn't force swapping though which is its weakness.

I worked with one client that was thrilled at getting < 5ms response
times for their stress tests with many threads simultaneously
executing queries except they were firing the exact same query
over and over and over.

Best,
Erick

On Mon, Mar 7, 2016 at 7:36 PM, shamik <sham...@gmail.com> wrote:
> Thanks Eric and Walter, this is extremely insightful. One last followup
> question on composite routing. I'm trying to have a better understanding of
> index distribution. If I use language as a prefix, SolrCloud guarantees that
> same language content will be routed to the same shard. What I'm curious to
> know is how rest of the data is being distributed across remaining shards.
> For e.g. I've the following composite keys,
>
> enu!doc1
> enu!doc2
> deu!doc3
> deu!doc4
> esp!doc5
> chs!doc6
>
> If I've 2 shards in the cluster, will SolrCloud try to distribute the above
> data evenly? Is is possible that enu will be routed to shard1 while deu goes
> to shard2, and esp and chs gets indexed in either of them. Or, all of them
> can potentially end up getting indexed in the same shard, either 1 or 2,
> leaving one shard under-utilized.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks Eric and Walter, this is extremely insightful. One last followup
question on composite routing. I'm trying to have a better understanding of
index distribution. If I use language as a prefix, SolrCloud guarantees that
same language content will be routed to the same shard. What I'm curious to
know is how rest of the data is being distributed across remaining shards.
For e.g. I've the following composite keys,

enu!doc1
enu!doc2
deu!doc3
deu!doc4
esp!doc5
chs!doc6

If I've 2 shards in the cluster, will SolrCloud try to distribute the above
data evenly? Is is possible that enu will be routed to shard1 while deu goes
to shard2, and esp and chs gets indexed in either of them. Or, all of them
can potentially end up getting indexed in the same shard, either 1 or 2,
leaving one shard under-utilized.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud sharding strategy

2016-03-07 Thread Walter Underwood
Excellent advice, and I’d like to reinforce a few things.

* Solr indexing is CPU intensive and generates lots of disk IO. Faster CPUs and 
faster disks matter a lot.
* Realistic user query logs are super important. We measure 95th percentile 
latency and that is dominated by rare and malformed queries.
* 5000 queries is not nearly enough. That totally fits in cache. I usually 
start with 100K, though I’d like more. Benchmarking a cached system is one of 
the hardest things in devops.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Mar 7, 2016, at 4:27 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Still, 50M is not excessive for a single shard although it's getting
> into the range that I'd like proof that my hardware etc. is adequate
> before committing to it. I've seen up to 300M docs on a single
> machine, admittedly they were tweets. YMMV based on hardware and index
> complexity of course. Here's a long blog about sizing:
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> 
> In this case I'd be pretty comfortable by creating a test harness
> (using jMeter or the like) and faking the extra 30M documents by
> re-indexing the current corpus but assigning new IDs (<uniqueKey).
> Keep doing this until your target machine breaks (i.e. either blows up
> by exhausting memory or the response slows unacceptably) and that'll
> give you a good upper bound. Note that you should plan on a couple of
> rounds of tuning/testing when you start to have problems.
> 
> I'll warn you up front, though, that unless you have an existing app
> to mine for _real_ user queries, generating say 5,000 "typical"
> queries is more of a challenge than you might expect ;)...
> 
> Now, all that said all is not lost if you do go with a single shard.
> Let's say that 6 months down the road your requirements change. Or the
> initial estimate was off. Or
> 
> There are a couple of options:
> 1> create a new collection with more shards and re-index from scratch
> 2> use the SPLITSHARD Collections API all to, well, split the shard.
> 
> 
> In this latter case, a shard is split into two pieces of roughly equal
> size, which does mean that you can only grow your shard count by
> powers of 2.
> 
> And even if you do have a single shard, using SolrCloud is still a
> good thing as the failover is automagically handled assuming you have
> more than one replica...
> 
> Best,
> Erick
> 
> On Mon, Mar 7, 2016 at 4:05 PM, shamik <sham...@gmail.com> wrote:
>> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
>> documents, but the growth projection around 50 million in next 6-8 months.
>> It'll continue to grow, but maybe not at the same rate. From the index size
>> point of view, the size can grow up to half a TB from its current state.
>> Honestly, my perception of "big" index is still vague :-) . All I'm trying
>> to make sure is that decision I take is scalable in the long term and will
>> be able to sustain the growth without compromising the performance.
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
Still, 50M is not excessive for a single shard although it's getting
into the range that I'd like proof that my hardware etc. is adequate
before committing to it. I've seen up to 300M docs on a single
machine, admittedly they were tweets. YMMV based on hardware and index
complexity of course. Here's a long blog about sizing:
https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

In this case I'd be pretty comfortable by creating a test harness
(using jMeter or the like) and faking the extra 30M documents by
re-indexing the current corpus but assigning new IDs (<uniqueKey).
Keep doing this until your target machine breaks (i.e. either blows up
by exhausting memory or the response slows unacceptably) and that'll
give you a good upper bound. Note that you should plan on a couple of
rounds of tuning/testing when you start to have problems.

I'll warn you up front, though, that unless you have an existing app
to mine for _real_ user queries, generating say 5,000 "typical"
queries is more of a challenge than you might expect ;)...

Now, all that said all is not lost if you do go with a single shard.
Let's say that 6 months down the road your requirements change. Or the
initial estimate was off. Or

There are a couple of options:
1> create a new collection with more shards and re-index from scratch
2> use the SPLITSHARD Collections API all to, well, split the shard.


In this latter case, a shard is split into two pieces of roughly equal
size, which does mean that you can only grow your shard count by
powers of 2.

And even if you do have a single shard, using SolrCloud is still a
good thing as the failover is automagically handled assuming you have
more than one replica...

Best,
Erick

On Mon, Mar 7, 2016 at 4:05 PM, shamik <sham...@gmail.com> wrote:
> Thanks a lot, Erick. You are right, it's a tad small with around 20 million
> documents, but the growth projection around 50 million in next 6-8 months.
> It'll continue to grow, but maybe not at the same rate. From the index size
> point of view, the size can grow up to half a TB from its current state.
> Honestly, my perception of "big" index is still vague :-) . All I'm trying
> to make sure is that decision I take is scalable in the long term and will
> be able to sustain the growth without compromising the performance.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud sharding strategy

2016-03-07 Thread shamik
Thanks a lot, Erick. You are right, it's a tad small with around 20 million
documents, but the growth projection around 50 million in next 6-8 months.
It'll continue to grow, but maybe not at the same rate. From the index size
point of view, the size can grow up to half a TB from its current state.
Honestly, my perception of "big" index is still vague :-) . All I'm trying
to make sure is that decision I take is scalable in the long term and will
be able to sustain the growth without compromising the performance.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Cloud-sharding-strategy-tp4262274p4262304.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Cloud sharding strategy

2016-03-07 Thread Erick Erickson
20M docs is actually a very small collection by the "usual" Solr
standards unless they're _really_ large documents, i.e.
large books.

Actually, I wouldn't even shard to begin with, it's unlikely that it's
necessary and it adds inevitable overhead. If you _must_ shard,
just go with <1>, but again I would be surprised if it was even
necessary.

Best,
Erick

On Mon, Mar 7, 2016 at 2:35 PM, Shamik Bandopadhyay  wrote:
> Hi,
>
>   I'm trying to figure the best way to design/allocate shards for our Solr
> Cloud environment.Our current index has around 20 million documents, in 10
> languages. Around 25-30% of the content is in English. Rest are almost
> equally distributed among the remaining 13 languages. Till now, we had to
> deal with query time deduplication using collapsing parser  for which we
> used multi-level composite routing. But due to that, documents were
> disproportionately distributed across 3 shards. The shard containing the
> duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
> 30gb index while Shard2 and Shard3 10gb each. The composite key is
> currently made of "language!dedup_id!url" . At query time, we are using
> shard.keys=language/8! for three level routing.
>
> Due to performance overhead, we decided to move the de-duplication logic
> during index time which made the composite routing redundant. We are not
> discarding the duplicate content so there's no change in index size.Before
> I update the routing key, just wanted to check what will be the best
> approach to the sharding architecture so that we get optimal performance.
> We've currently have 3 shards wth 2 replicas each. The entire index resides
> in one single collection. What I'm trying to understand is whether:
>
> 1. We let Solr use simple document routing based on id and route the
> documents to any of the 3 shards
> 2. We create a composite id using language, e.g. language!unique_id and
> make sure that the same language content will always be in same the shard.
> What I'm not sure is whether the index will be equally distributed across
> the three shards.
> 3. Index English only content to a dedicated shard, rest equally
> distributed to the remaining two. I'm not sure if that's possible.
> 4. Create a dedicated collection for English and one for rest of the
> languages.
>
> Any pointers on this will be highly appreciated.
>
> Regards,
> Shamik


Solr Cloud sharding strategy

2016-03-07 Thread Shamik Bandopadhyay
Hi,

  I'm trying to figure the best way to design/allocate shards for our Solr
Cloud environment.Our current index has around 20 million documents, in 10
languages. Around 25-30% of the content is in English. Rest are almost
equally distributed among the remaining 13 languages. Till now, we had to
deal with query time deduplication using collapsing parser  for which we
used multi-level composite routing. But due to that, documents were
disproportionately distributed across 3 shards. The shard containing the
duplicate data ended up hosting 80% of the index. For e.g. Shard1 had a
30gb index while Shard2 and Shard3 10gb each. The composite key is
currently made of "language!dedup_id!url" . At query time, we are using
shard.keys=language/8! for three level routing.

Due to performance overhead, we decided to move the de-duplication logic
during index time which made the composite routing redundant. We are not
discarding the duplicate content so there's no change in index size.Before
I update the routing key, just wanted to check what will be the best
approach to the sharding architecture so that we get optimal performance.
We've currently have 3 shards wth 2 replicas each. The entire index resides
in one single collection. What I'm trying to understand is whether:

1. We let Solr use simple document routing based on id and route the
documents to any of the 3 shards
2. We create a composite id using language, e.g. language!unique_id and
make sure that the same language content will always be in same the shard.
What I'm not sure is whether the index will be equally distributed across
the three shards.
3. Index English only content to a dedicated shard, rest equally
distributed to the remaining two. I'm not sure if that's possible.
4. Create a dedicated collection for English and one for rest of the
languages.

Any pointers on this will be highly appreciated.

Regards,
Shamik