Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-30 Thread Matteo Grolla
Wow,
thanks both for the suggestions

Erik: good point for the uneven shard load
I'm not worried about the growth of a particular shard, in case I'd use 
shard splitting and if necessary add a server to the cluster
but even if I manage to spread docs of typeA producer evenly on the 
cluster I could have an uneven query distribution (the two problems are very 
similar)
at time t I could have a shard queried by 11 type A producers 
while another shard is being queried by a single type A producer, not ideal
So I could use few bits (0 or 1) of the composite id for typeA 
producer's docs to avoid those kinds of problems

For typeB and typeC producers the problems discussed above seem unlikely, so 
I'd like to weight pros and cons of sharding on userid
pros
I'm reducing the size of the problem, instead of searching across the 
whole repository I'm searching only a part of it
cons
I could have uneven distribution of documents and queries across the 
cluster (unlikely, there are lots of users of typeB, typeC)
docs for one user aren't searched in parallel using more shards
this could be useful if one users produces so many docs to 
benefit from sharding (should happen only for typeA)

I think the pro is appealing, under these hypothesis if users of type B, C 
increase I can scale the system without many concerns

Do you agree? 


Il giorno 29/mag/2015, alle ore 20:18, Reitzel, Charles ha scritto:

 Thanks, Erick.   I appreciate the sanity check.
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com] 
 Sent: Thursday, May 28, 2015 5:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality using 
 compositeId to enable shard splitting
 
 Charles:
 
 You raise good points, and I didn't mean to say that co-locating docs due to 
 some critera was never a good idea. That said, it does add administrative 
 complexity that I'd prefer to avoid unless necessary.
 
 I suppose it largely depends on what the load and response SLAs are.
 If there's 1 query/second peak load, the sharding overhead for queries is 
 probably not noticeable. If there are 1,000 QPS, then it might be worth it.
 
 Measure, measure, measure..
 
 I think your composite ID understanding is fine.
 
 Best,
 Erick
 
 On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles 
 charles.reit...@tiaa-cref.org wrote:
 We have used a similar sharding strategy for exactly the reasons you say.   
 But we are fairly certain that the # of documents per user ID is  5000 and, 
 typically, 500.   Thus, we think the overhead of distributed searches 
 clearly outweighs the benefits.   Would you agree?   We have done some load 
 testing (with 100's of simultaneous users) and performance has been good 
 with data and queries distributed evenly across shards.
 
 In Matteo's case, this model appears to apply well to user types B and C.
 Not sure about user type A, though.At  100,000 docs per user per year, 
 on average, that load seems ok for one node.   But, is it enough to benefit 
 significantly from a parallel search?
 
 With a 2 part composite ID, each part will contribute 16 bits to a 32 bit 
 hash value, which is then compared to the set of hash ranges for each active 
 shard.   Since the user ID will contribute the high-order bytes, it will 
 dominate in matching the target shard(s).   But dominance doesn't mean the 
 lower order 16 bits will always be ignored, does it?   I.e. if the original 
 shard has been split, perhaps multiple times, isn't it possible that one 
 user IDs documents will be spread over a multiple shards?
 
 In Matteo's case, it might make sense to specify fewer bits to the user ID 
 for user category A.   I.e. what I described above is the default for 
 userId!docId.   But if you use userId/8!docId/24 (8 bits for userId and 24 
 bits for the document ID), then couldn't one user's docs might be split over 
 multiple shards, even without splitting?
 
 I'm just making sure I understand how composite ID sharding works correctly. 
   Have I got it right?  Has any of this logic changed in 5.x?
 
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, May 21, 2015 11:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality 
 using compositeId to enable shard splitting
 
 I question your base assumption:
 
 bq: So shard by document producer seems a good choice
 
 Because what this _also_ does is force all of the work for a query onto one 
 node and all indexing for a particular producer ditto. And will cause you to 
 manually monitor your shards to see if some of them grow out of proportion 
 to others. And
 
 I think it would be much less hassle to just let Solr distribute the docs as 
 it may based on the uniqueKey and forget about it. Unless you want, say, to 
 do joins etc

RE: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-29 Thread Reitzel, Charles
Thanks, Erick.   I appreciate the sanity check.

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, May 28, 2015 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: optimal shard assignment with low shard key cardinality using 
compositeId to enable shard splitting

Charles:

You raise good points, and I didn't mean to say that co-locating docs due to 
some critera was never a good idea. That said, it does add administrative 
complexity that I'd prefer to avoid unless necessary.

I suppose it largely depends on what the load and response SLAs are.
If there's 1 query/second peak load, the sharding overhead for queries is 
probably not noticeable. If there are 1,000 QPS, then it might be worth it.

Measure, measure, measure..

I think your composite ID understanding is fine.

Best,
Erick

On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles 
charles.reit...@tiaa-cref.org wrote:
 We have used a similar sharding strategy for exactly the reasons you say.   
 But we are fairly certain that the # of documents per user ID is  5000 and, 
 typically, 500.   Thus, we think the overhead of distributed searches 
 clearly outweighs the benefits.   Would you agree?   We have done some load 
 testing (with 100's of simultaneous users) and performance has been good with 
 data and queries distributed evenly across shards.

 In Matteo's case, this model appears to apply well to user types B and C.
 Not sure about user type A, though.At  100,000 docs per user per year, 
 on average, that load seems ok for one node.   But, is it enough to benefit 
 significantly from a parallel search?

 With a 2 part composite ID, each part will contribute 16 bits to a 32 bit 
 hash value, which is then compared to the set of hash ranges for each active 
 shard.   Since the user ID will contribute the high-order bytes, it will 
 dominate in matching the target shard(s).   But dominance doesn't mean the 
 lower order 16 bits will always be ignored, does it?   I.e. if the original 
 shard has been split, perhaps multiple times, isn't it possible that one user 
 IDs documents will be spread over a multiple shards?

 In Matteo's case, it might make sense to specify fewer bits to the user ID 
 for user category A.   I.e. what I described above is the default for 
 userId!docId.   But if you use userId/8!docId/24 (8 bits for userId and 24 
 bits for the document ID), then couldn't one user's docs might be split over 
 multiple shards, even without splitting?

 I'm just making sure I understand how composite ID sharding works correctly.  
  Have I got it right?  Has any of this logic changed in 5.x?

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, May 21, 2015 11:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality 
 using compositeId to enable shard splitting

 I question your base assumption:

 bq: So shard by document producer seems a good choice

  Because what this _also_ does is force all of the work for a query onto one 
 node and all indexing for a particular producer ditto. And will cause you to 
 manually monitor your shards to see if some of them grow out of proportion to 
 others. And

 I think it would be much less hassle to just let Solr distribute the docs as 
 it may based on the uniqueKey and forget about it. Unless you want, say, to 
 do joins etc There will, of course, be some overhead that you pay here, 
 but unless you an measure it and it's a pain I wouldn't add the complexity 
 you're talking about, especially at the volumes you're talking.

 Best,
 Erick

 On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com 
 wrote:
 Hi
 I'd like some feedback on how I'd like to solve the following 
 sharding problem


 I have a collection that will eventually become big

 Average document size is 1.5kb
 Every year 30 Million documents will be indexed

 Data come from different document producers (a person, owner of his
 documents) and queries are almost always performed by a document 
 producer who can only query his own document. So shard by document 
 producer seems a good choice

 there are 3 types of doc producer
 type A,
 cardinality 105 (there are 105 producers of this type) produce 17M 
 docs/year (the aggregated production af all type A producers) type B 
 cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 
 9M docs/year

 I'm thinking about
 use compositeId ( solrDocId = producerId!docId ) to send all docs of the 
 same producer to the same shards. When a shard becomes too large I can use 
 shard splitting.

 problems
 -documents from type A producers could be oddly distributed among 
 shards, because hashing doesn't work well on small numbers (105) see 
 Appendix

 As a solution I could do this when a new typeA producer (producerA1) arrives:

 1) client app: generate a producer code
 2) client app: simulate murmurhashing

RE: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-28 Thread Reitzel, Charles
We have used a similar sharding strategy for exactly the reasons you say.   But 
we are fairly certain that the # of documents per user ID is  5000 and, 
typically, 500.   Thus, we think the overhead of distributed searches clearly 
outweighs the benefits.   Would you agree?   We have done some load testing 
(with 100's of simultaneous users) and performance has been good with data and 
queries distributed evenly across shards.

In Matteo's case, this model appears to apply well to user types B and C.
Not sure about user type A, though.At  100,000 docs per user per year, on 
average, that load seems ok for one node.   But, is it enough to benefit 
significantly from a parallel search?

With a 2 part composite ID, each part will contribute 16 bits to a 32 bit hash 
value, which is then compared to the set of hash ranges for each active shard.  
 Since the user ID will contribute the high-order bytes, it will dominate in 
matching the target shard(s).   But dominance doesn't mean the lower order 16 
bits will always be ignored, does it?   I.e. if the original shard has been 
split, perhaps multiple times, isn't it possible that one user IDs documents 
will be spread over a multiple shards?

In Matteo's case, it might make sense to specify fewer bits to the user ID for 
user category A.   I.e. what I described above is the default for userId!docId. 
  But if you use userId/8!docId/24 (8 bits for userId and 24 bits for the 
document ID), then couldn't one user's docs might be split over multiple 
shards, even without splitting?

I'm just making sure I understand how composite ID sharding works correctly.   
Have I got it right?  Has any of this logic changed in 5.x?

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, May 21, 2015 11:30 AM
To: solr-user@lucene.apache.org
Subject: Re: optimal shard assignment with low shard key cardinality using 
compositeId to enable shard splitting

I question your base assumption:

bq: So shard by document producer seems a good choice

 Because what this _also_ does is force all of the work for a query onto one 
node and all indexing for a particular producer ditto. And will cause you to 
manually monitor your shards to see if some of them grow out of proportion to 
others. And

I think it would be much less hassle to just let Solr distribute the docs as it 
may based on the uniqueKey and forget about it. Unless you want, say, to do 
joins etc There will, of course, be some overhead that you pay here, but 
unless you an measure it and it's a pain I wouldn't add the complexity you're 
talking about, especially at the volumes you're talking.

Best,
Erick

On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com wrote:
 Hi
 I'd like some feedback on how I'd like to solve the following sharding 
 problem


 I have a collection that will eventually become big

 Average document size is 1.5kb
 Every year 30 Million documents will be indexed

 Data come from different document producers (a person, owner of his 
 documents) and queries are almost always performed by a document 
 producer who can only query his own document. So shard by document 
 producer seems a good choice

 there are 3 types of doc producer
 type A,
 cardinality 105 (there are 105 producers of this type) produce 17M 
 docs/year (the aggregated production af all type A producers) type B 
 cardinality ~10k produce 4M docs/year type C cardinality ~10M produce 
 9M docs/year

 I'm thinking about
 use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
 producer to the same shards. When a shard becomes too large I can use shard 
 splitting.

 problems
 -documents from type A producers could be oddly distributed among 
 shards, because hashing doesn't work well on small numbers (105) see 
 Appendix

 As a solution I could do this when a new typeA producer (producerA1) arrives:

 1) client app: generate a producer code
 2) client app: simulate murmurhashing and shard assignment
 3) client app: check shard assignment is optimal (producer code is 
 assigned to the shard with the least type A producers) otherwise goto 
 1) and try with another code

 when I add documents or perform searches for producerA1 I use it's 
 producer code respectively in the compositeId or in the route parameter What 
 do you think?


 ---Appendix: murmurhash shard assignment 
 simulation---

 import mmh3

 hashes = [mmh3.hash(str(i))16 for i in xrange(105)]

 num_shards = 16
 shards = [0]*num_shards

 for hash in hashes:
 idx = hash % num_shards
 shards[idx] += 1

 print shards
 print sum(shards)

 -

 result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

 so with 16 shards and 105 shard keys I can have shards with 1 key 
 shards with 11 keys


*
This e-mail may contain confidential or privileged information.
If you

Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-28 Thread Erick Erickson
Charles:

You raise good points, and I didn't mean to say that co-locating docs
due to some critera was never a good idea. That said, it does add
administrative complexity that I'd prefer to avoid unless necessary.

I suppose it largely depends on what the load and response SLAs are.
If there's 1 query/second peak load, the sharding overhead for queries
is probably not noticeable. If there are 1,000 QPS, then it might be
worth it.

Measure, measure, measure..

I think your composite ID understanding is fine.

Best,
Erick

On Thu, May 28, 2015 at 1:40 PM, Reitzel, Charles
charles.reit...@tiaa-cref.org wrote:
 We have used a similar sharding strategy for exactly the reasons you say.   
 But we are fairly certain that the # of documents per user ID is  5000 and, 
 typically, 500.   Thus, we think the overhead of distributed searches 
 clearly outweighs the benefits.   Would you agree?   We have done some load 
 testing (with 100's of simultaneous users) and performance has been good with 
 data and queries distributed evenly across shards.

 In Matteo's case, this model appears to apply well to user types B and C.
 Not sure about user type A, though.At  100,000 docs per user per year, 
 on average, that load seems ok for one node.   But, is it enough to benefit 
 significantly from a parallel search?

 With a 2 part composite ID, each part will contribute 16 bits to a 32 bit 
 hash value, which is then compared to the set of hash ranges for each active 
 shard.   Since the user ID will contribute the high-order bytes, it will 
 dominate in matching the target shard(s).   But dominance doesn't mean the 
 lower order 16 bits will always be ignored, does it?   I.e. if the original 
 shard has been split, perhaps multiple times, isn't it possible that one user 
 IDs documents will be spread over a multiple shards?

 In Matteo's case, it might make sense to specify fewer bits to the user ID 
 for user category A.   I.e. what I described above is the default for 
 userId!docId.   But if you use userId/8!docId/24 (8 bits for userId and 24 
 bits for the document ID), then couldn't one user's docs might be split over 
 multiple shards, even without splitting?

 I'm just making sure I understand how composite ID sharding works correctly.  
  Have I got it right?  Has any of this logic changed in 5.x?

 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Thursday, May 21, 2015 11:30 AM
 To: solr-user@lucene.apache.org
 Subject: Re: optimal shard assignment with low shard key cardinality using 
 compositeId to enable shard splitting

 I question your base assumption:

 bq: So shard by document producer seems a good choice

  Because what this _also_ does is force all of the work for a query onto one 
 node and all indexing for a particular producer ditto. And will cause you to 
 manually monitor your shards to see if some of them grow out of proportion to 
 others. And

 I think it would be much less hassle to just let Solr distribute the docs as 
 it may based on the uniqueKey and forget about it. Unless you want, say, to 
 do joins etc There will, of course, be some overhead that you pay here, 
 but unless you an measure it and it's a pain I wouldn't add the complexity 
 you're talking about, especially at the volumes you're talking.

 Best,
 Erick

 On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com 
 wrote:
 Hi
 I'd like some feedback on how I'd like to solve the following sharding
 problem


 I have a collection that will eventually become big

 Average document size is 1.5kb
 Every year 30 Million documents will be indexed

 Data come from different document producers (a person, owner of his
 documents) and queries are almost always performed by a document
 producer who can only query his own document. So shard by document
 producer seems a good choice

 there are 3 types of doc producer
 type A,
 cardinality 105 (there are 105 producers of this type) produce 17M
 docs/year (the aggregated production af all type A producers) type B
 cardinality ~10k produce 4M docs/year type C cardinality ~10M produce
 9M docs/year

 I'm thinking about
 use compositeId ( solrDocId = producerId!docId ) to send all docs of the 
 same producer to the same shards. When a shard becomes too large I can use 
 shard splitting.

 problems
 -documents from type A producers could be oddly distributed among
 shards, because hashing doesn't work well on small numbers (105) see
 Appendix

 As a solution I could do this when a new typeA producer (producerA1) arrives:

 1) client app: generate a producer code
 2) client app: simulate murmurhashing and shard assignment
 3) client app: check shard assignment is optimal (producer code is
 assigned to the shard with the least type A producers) otherwise goto
 1) and try with another code

 when I add documents or perform searches for producerA1 I use it's
 producer code respectively in the compositeId or in the route

optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-21 Thread Matteo Grolla
Hi
I'd like some feedback on how I'd like to solve the following sharding problem


I have a collection that will eventually become big

Average document size is 1.5kb
Every year 30 Million documents will be indexed

Data come from different document producers (a person, owner of his documents) 
and queries are almost always performed by a document producer who can only 
query his own document. So shard by document producer seems a good choice

there are 3 types of doc producer
type A, 
cardinality 105 (there are 105 producers of this type)
produce 17M docs/year (the aggregated production af all type A producers)
type B
cardinality ~10k
produce 4M docs/year
type C
cardinality ~10M
produce 9M docs/year

I'm thinking about 
use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
producer to the same shards. When a shard becomes too large I can use shard 
splitting.

problems
-documents from type A producers could be oddly distributed among shards, 
because hashing doesn't work well on small numbers (105) see Appendix

As a solution I could do this when a new typeA producer (producerA1) arrives:

1) client app: generate a producer code
2) client app: simulate murmurhashing and shard assignment
3) client app: check shard assignment is optimal (producer code is assigned to 
the shard with the least type A producers) otherwise goto 1) and try with 
another code

when I add documents or perform searches for producerA1 I use it's producer 
code respectively in the compositeId or in the route parameter
What do you think?


---Appendix: murmurhash shard assignment 
simulation---

import mmh3

hashes = [mmh3.hash(str(i))16 for i in xrange(105)]

num_shards = 16
shards = [0]*num_shards

for hash in hashes:
idx = hash % num_shards
shards[idx] += 1

print shards
print sum(shards)

-

result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

so with 16 shards and 105 shard keys I can have
shards with 1 key
shards with 11 keys



Re: optimal shard assignment with low shard key cardinality using compositeId to enable shard splitting

2015-05-21 Thread Erick Erickson
I question your base assumption:

bq: So shard by document producer seems a good choice

 Because what this _also_ does is force all of the work for a query
onto one node and all indexing for a particular producer ditto. And
will cause you to manually monitor your shards to see if some of them
grow out of proportion to others. And

I think it would be much less hassle to just let Solr distribute the
docs as it may based on the uniqueKey and forget about it. Unless you
want, say, to do joins etc There will, of course, be some overhead
that you pay here, but unless you an measure it and it's a pain I
wouldn't add the complexity you're talking about, especially at the
volumes you're talking.

Best,
Erick

On Thu, May 21, 2015 at 3:20 AM, Matteo Grolla matteo.gro...@gmail.com wrote:
 Hi
 I'd like some feedback on how I'd like to solve the following sharding problem


 I have a collection that will eventually become big

 Average document size is 1.5kb
 Every year 30 Million documents will be indexed

 Data come from different document producers (a person, owner of his 
 documents) and queries are almost always performed by a document producer who 
 can only query his own document. So shard by document producer seems a good 
 choice

 there are 3 types of doc producer
 type A,
 cardinality 105 (there are 105 producers of this type)
 produce 17M docs/year (the aggregated production af all type A producers)
 type B
 cardinality ~10k
 produce 4M docs/year
 type C
 cardinality ~10M
 produce 9M docs/year

 I'm thinking about
 use compositeId ( solrDocId = producerId!docId ) to send all docs of the same 
 producer to the same shards. When a shard becomes too large I can use shard 
 splitting.

 problems
 -documents from type A producers could be oddly distributed among shards, 
 because hashing doesn't work well on small numbers (105) see Appendix

 As a solution I could do this when a new typeA producer (producerA1) arrives:

 1) client app: generate a producer code
 2) client app: simulate murmurhashing and shard assignment
 3) client app: check shard assignment is optimal (producer code is assigned 
 to the shard with the least type A producers) otherwise goto 1) and try with 
 another code

 when I add documents or perform searches for producerA1 I use it's producer 
 code respectively in the compositeId or in the route parameter
 What do you think?


 ---Appendix: murmurhash shard assignment 
 simulation---

 import mmh3

 hashes = [mmh3.hash(str(i))16 for i in xrange(105)]

 num_shards = 16
 shards = [0]*num_shards

 for hash in hashes:
 idx = hash % num_shards
 shards[idx] += 1

 print shards
 print sum(shards)

 -

 result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]

 so with 16 shards and 105 shard keys I can have
 shards with 1 key
 shards with 11 keys