Hi
I'd like some feedback on how I'd like to solve the following sharding problem
I have a collection that will eventually become big
Average document size is 1.5kb
Every year 30 Million documents will be indexed
Data come from different document producers (a person, owner of his documents)
and queries are almost always performed by a document producer who can only
query his own document. So shard by document producer seems a good choice
there are 3 types of doc producer
type A,
cardinality 105 (there are 105 producers of this type)
produce 17M docs/year (the aggregated production af all type A producers)
type B
cardinality ~10k
produce 4M docs/year
type C
cardinality ~10M
produce 9M docs/year
I'm thinking about
use compositeId ( solrDocId = producerId!docId ) to send all docs of the same
producer to the same shards. When a shard becomes too large I can use shard
splitting.
problems
-documents from type A producers could be oddly distributed among shards,
because hashing doesn't work well on small numbers (105) see Appendix
As a solution I could do this when a new typeA producer (producerA1) arrives:
1) client app: generate a producer code
2) client app: simulate murmurhashing and shard assignment
3) client app: check shard assignment is optimal (producer code is assigned to
the shard with the least type A producers) otherwise goto 1) and try with
another code
when I add documents or perform searches for producerA1 I use it's producer
code respectively in the compositeId or in the route parameter
What do you think?
-----------Appendix: murmurhash shard assignment
simulation-----------------------
import mmh3
hashes = [mmh3.hash(str(i))>>16 for i in xrange(105)]
num_shards = 16
shards = [0]*num_shards
for hash in hashes:
idx = hash % num_shards
shards[idx] += 1
print shards
print sum(shards)
-------------
result: [4, 10, 6, 7, 8, 6, 7, 8, 11, 1, 8, 5, 6, 5, 5, 8]
so with 16 shards and 105 shard keys I can have
shards with 1 key
shards with 11 keys