Re: Does the number of partitions affect performance?

Travis Klein Sat, 28 Nov 2020 13:01:15 -0800

Thanks Joan I was also looking into partitioning my postsdb by type as the data 
gets queried by type. If it’s going to be dropped in v4 I’ll stick to mango 
index’s as well :)


> On Nov 27, 2020, at 2:11 AM, Olaf Krueger <[email protected]> wrote:
> 
> Joan,
> 
> thank you so much for this awesome explanation!
> 
> It was tempting to use the "/partition" endpoints in order to be able to 
> immedeately fetch a group of docs "out of the box" (Even if using mango 
> indexes isn't as hard). 
> We already use prefixes within our _ids and try to compose/optimize the 
> "_ids" for querying the "_all_docs" view  whenever it makes sense instead of 
> creating an index for each use case.
> Using the build-in partition-key over custom "_id" prefixes felt a way 
> cleaner and suggested having a more structured database (from a humans 
> perspective and the desire to not loose control).
> But, beside the 4.0. hint and thanks to your explanation it turns out to me 
> that partitions really doesn't makes sense in our use case. We'll stick with 
> our "_id" patterns and creating indexes/views when needed, it just works.
> 
> Keep up the great work!
> 
> Thanks,
> Olaf
> 
> 
> On 2020/11/26 16:38:08, Joan Touzet <[email protected]> wrote: 
>> Hi Olaf, I don't have extensive experience with partitioned databases,
>> but I'm happy to provide some recommendations.
>> 
>> On 26/11/2020 01:48, Olaf Krueger wrote:
>>> Hi guys,
>>> 
>>> I am in the process of preparing our docs for partitioning.
>>> I guess a lot of us are already using something like a "docType" prop 
>>> within each document.
>>> So it seems to be obvious to partition the docs by its "docType".
>> 
>> This isn't typically what partitioned databases are used for. As the
>> documentation example shows, partitioning is great when you want,
>> *consistently*, to select a small percentage of documents out of a very
>> large database.
>> 
>> In the provided example, you have an IoT application, with (let's say)
>> thousands of sensors that all record a few readings a day:
>> 
>>    100000 sensors * 10 readings a day = 1mil documents a day
>> 
>> But most often you look at this data per-sensor, and let's say you're
>> looking at an individual sensor's data for a week. That would be:
>> 
>>    10 readings * 7 days = 70 documents / 7mil documents = 0.001%
>> 
>> This is why an index doesn't make sense in this case. Indexing by sensor
>> name, then retrieving just that 0.001% of the data, requires looking
>> across all of the shards in your database and getting an answer from all
>> of them. In CouchDB 2.x, this would be q=8 shards by default - but a
>> database that is growing by 1mil documents a day might well have q=16,
>> 24, 32 or even more shards. Multiply by n=3 and that could be up to over
>> 100 shards consulted across the cluster to retrieve a very small amount
>> of documents -- many of which will return no matches.
>> 
>> Partitioning keeps all of the related together in a single shard, so the
>> query for 70 documents above will all come from a single shard. That
>> means not waiting for 8*3=24 Erlang sub-processes (internal to CouchDB)
>> all to respond, then have those results collated, before getting a response.
>> 
>> The more critical portion is that secondary indexes are also scoped only
>> to that partition. As the documentation says on the last line:
>> 
>>> To be clear, this means that global queries perform identically to queries 
>>> on non-partitioned databases. Only partitioned queries on a partitioned 
>>> database benefit from the performance improvements.
>> 
>> So the real reason for jumping through the partitioned database hoops is
>> only when you know, conclusively, that you're going to want primarily to
>> ask questions only of your partitions, not globally. Keep in mind that
>> the recommendation for partitioned databases is to have a very large
>> number of partitions. That means that if you ever need to ask a global
>> question, you might not just be consulting 8 or 16 shards, but something
>> like 100000 partitions for your answer. That's considerably slower (and
>> harder for CouchDB to collate) than asking just 8 shards.
>> 
>> In my opinion, you only want to make this optimization if your data
>> meets this specific design pattern. (Another example would be a unified,
>> partition-per-user approach.) Maybe it makes sense in a different ratio
>> of docs-to-partitions, but I've not had exposure to that scenario (yet).
>> 
>>> Depending on the database, this would lead to a certain number of 
>>> partitions.
>>> Let's say 10, or maybe 100 or more over the time.
>> 
>> In your case, standard Mango indexes (or JavaScript queries) is the
>> right approach. Partitions were introduced for a very specific reason:
>> when the pattern of user data leads to partitioning better than
>> CouchDB's automatic sharding algorithm, and where both primary and
>> secondary index lookups are only ever going to access documents within a
>> specific partition of documents.
>> 
>>> So I wonder, is there any limit for the number of partitions so that should 
>>> we think about more wisely about how to partition our database?
>> 
>> You also ask:
>> 
>>> Imagine we have e.g. 100.000 docs "of the same type" within a single 
>>> partition and we're far away from having 100.000 partitions and more across 
>>> the database.
>>> Could this be a hint that our docs are too complex and should rather 
>>> splittet into smaller docs?
>> 
>> That's a very different question...one that would require looking more
>> in depth at your documents and query patterns.
>> 
>> I would personally look at Mango partial indexes first - where you build
>> an index that contains only documents of a certain type. You can then
>> more easily ask sub-queries of that document type, such as a sub-date
>> range, or a sub-type.
>> 
>> One last thing: CouchDB 4.x will not (under the covers) implement
>> partitioned databases, as they provide no speedup in the data storage.
>> We're keeping the endpoints for now, just for compatibility, but they'll
>> eventually be dropped. Given this, unless there's a real compelling need
>> to bake partition-based queries throughout your app code base, I would
>> avoid them.
>> 
>> -Joan "parted -l /dev/couchdb" Touzet
>>

Re: Does the number of partitions affect performance?

Reply via email to