Re: Does the number of partitions affect performance?

Joan Touzet Thu, 26 Nov 2020 08:38:23 -0800

Hi Olaf, I don't have extensive experience with partitioned databases,
but I'm happy to provide some recommendations.

On 26/11/2020 01:48, Olaf Krueger wrote:
> Hi guys,
> 
> I am in the process of preparing our docs for partitioning.
> I guess a lot of us are already using something like a "docType" prop within 
> each document.
> So it seems to be obvious to partition the docs by its "docType".

This isn't typically what partitioned databases are used for. As the
documentation example shows, partitioning is great when you want,
*consistently*, to select a small percentage of documents out of a very
large database.

In the provided example, you have an IoT application, with (let's say)
thousands of sensors that all record a few readings a day:

    100000 sensors * 10 readings a day = 1mil documents a day

But most often you look at this data per-sensor, and let's say you're
looking at an individual sensor's data for a week. That would be:

    10 readings * 7 days = 70 documents / 7mil documents = 0.001%

This is why an index doesn't make sense in this case. Indexing by sensor
name, then retrieving just that 0.001% of the data, requires looking
across all of the shards in your database and getting an answer from all
of them. In CouchDB 2.x, this would be q=8 shards by default - but a
database that is growing by 1mil documents a day might well have q=16,
24, 32 or even more shards. Multiply by n=3 and that could be up to over
100 shards consulted across the cluster to retrieve a very small amount
of documents -- many of which will return no matches.

Partitioning keeps all of the related together in a single shard, so the
query for 70 documents above will all come from a single shard. That
means not waiting for 8*3=24 Erlang sub-processes (internal to CouchDB)
all to respond, then have those results collated, before getting a response.

The more critical portion is that secondary indexes are also scoped only
to that partition. As the documentation says on the last line:

> To be clear, this means that global queries perform identically to queries on 
> non-partitioned databases. Only partitioned queries on a partitioned database 
> benefit from the performance improvements.

So the real reason for jumping through the partitioned database hoops is
only when you know, conclusively, that you're going to want primarily to
ask questions only of your partitions, not globally. Keep in mind that
the recommendation for partitioned databases is to have a very large
number of partitions. That means that if you ever need to ask a global
question, you might not just be consulting 8 or 16 shards, but something
like 100000 partitions for your answer. That's considerably slower (and
harder for CouchDB to collate) than asking just 8 shards.

In my opinion, you only want to make this optimization if your data
meets this specific design pattern. (Another example would be a unified,
partition-per-user approach.) Maybe it makes sense in a different ratio
of docs-to-partitions, but I've not had exposure to that scenario (yet).

> Depending on the database, this would lead to a certain number of partitions.
> Let's say 10, or maybe 100 or more over the time.

In your case, standard Mango indexes (or JavaScript queries) is the
right approach. Partitions were introduced for a very specific reason:
when the pattern of user data leads to partitioning better than
CouchDB's automatic sharding algorithm, and where both primary and
secondary index lookups are only ever going to access documents within a
specific partition of documents.

> So I wonder, is there any limit for the number of partitions so that should 
> we think about more wisely about how to partition our database?

You also ask:

> Imagine we have e.g. 100.000 docs "of the same type" within a single 
> partition and we're far away from having 100.000 partitions and more across 
> the database.
> Could this be a hint that our docs are too complex and should rather splittet 
> into smaller docs?

That's a very different question...one that would require looking more
in depth at your documents and query patterns.

I would personally look at Mango partial indexes first - where you build
an index that contains only documents of a certain type. You can then
more easily ask sub-queries of that document type, such as a sub-date
range, or a sub-type.

One last thing: CouchDB 4.x will not (under the covers) implement
partitioned databases, as they provide no speedup in the data storage.
We're keeping the endpoints for now, just for compatibility, but they'll
eventually be dropped. Given this, unless there's a real compelling need
to bake partition-based queries throughout your app code base, I would
avoid them.

-Joan "parted -l /dev/couchdb" Touzet

Re: Does the number of partitions affect performance?

Reply via email to