wohali commented on a change in pull request #385: Documentation for partitioned dbs URL: https://github.com/apache/couchdb-documentation/pull/385#discussion_r253231112
########## File path: src/partitioned-dbs/index.rst ########## @@ -0,0 +1,384 @@ +.. Licensed under the Apache License, Version 2.0 (the "License"); you may not +.. use this file except in compliance with the License. You may obtain a copy of +.. the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, WITHOUT +.. WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the +.. License for the specific language governing permissions and limitations under +.. the License. + +.. _partitioned-dbs: + +===================== +Partitioned Databases +===================== + +As a means to introducing partitioned databases we'll consider a motivating +use case to describe the benefits of this feature. For this example we'll +consider a database that stores readings from a large network of soil +moisture sensors. + +.. note:: + Before reading this document you should be familiar with the + :ref:`theory <cluster/theory>` of :ref:`sharding <cluster/sharding>` + in CouchDB. + + +Traditionally, a document in this database may have something like the +following structure: + +.. code-block:: javascript + + { + "_id": "sensor-reading-ca33c748-2d2c-4ed1-8abf-1bca4d9d03cf", + "_rev":"1-14e8f3262b42498dbd5c672c9d461ff0", + "sensor_id": "sensor-260", + "location": [41.6171031, -93.7705674], + "field_name": "Bob's Corn Field #5", + "readings": [ + ["2019-01-21T00:00:00", 0.15], + ["2019-01-21T06:00:00", 0.14], + ["2019-01-21T12:00:00", 0.16], + ["2019-01-21T18:00:00", 0.11] + ] + } + + +.. note:: + While this example uses IoT sensors, the main thing to consider is that + there is a logical grouping of documents. Similar use cases might be + documents grouped by user or scientific data grouped by experiment. + + +So we've got a bunch of sensors, all grouped by the field they monitor +along with their readouts for a given day (or other appropriate time period). + +Along with our documents we might expect to have two secondary indexes +for querying our database that might look something like: + +.. code-block:: javascript + + function(doc) { + if(doc._id.indexOf("sensor-reading-") != 0) { + return; + } + for(var r in doc.readings) { + emit([doc.sensor_id, r[0]], r[1]) + } + } + +and: + +.. code-block:: javascript + + function(doc) { + if(doc._id.indexOf("sensor-reading-") != 0) { + return; + } + emit(doc.field_name, doc.sensor_id) + } + +With these two indexes defined we can easily find all requests for a given +sensor, or list all sensors in a given field. + +Unfortunately, in CouchDB, when we read from either of these indexes, it +requires finding a copy of every shard and asking for any documents related +to the particular sensor or field. This means that as our database scales +up the number of shards, every index request must perform more work. +Fortunately for you, dear reader, partitioned databases were created to solve +this precise problem. + + +What is a partition? +==================== + +In the previous section, we introduced a hypothetical database that contains +sensor readings from an IoT field monitoring service. In this particular +use case, it's quite logical to group all documents by their ``sensor_id`` +field. In this case, we would call the ``sensor_id`` the partition. + +A good partition has two basic properties. First, it should have a high +cardinality. That is, there is a large number of values for the partition. +A database that has a single partition would be an anti-pattern for this +feature. Secondly, the amount of data per partition should be "small". The +general recommendation is to limit individual partitions to less than ten +gigabytes of data. Which, for the example of sensor documents, equates to roughly Review comment: "gigabytes (10 GB)" will help this be more readable through automatic translation ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services