willholley commented on a change in pull request #410: add rfc for map indexes
URL: 
https://github.com/apache/couchdb-documentation/pull/410#discussion_r285986527
 
 

 ##########
 File path: rfcs/008-map-indexes.md
 ##########
 @@ -0,0 +1,181 @@
+# Map indexes RFC
+
+---
+name: Formal RFC
+about: Submit a formal Request For Comments for consideration by the team.
+title: ‘Map indexes on FoundationDB’
+labels: rfc, discussion
+assignees: ''
+
+---
+
+## Introduction
+
+This document describes the data model and index management for building and 
querying map indexes.
+
+## Abstract
+
+Map indexes will have their own data model stored in FoundationDB. The model 
includes grouping map indexes via their design doc's view signature. Each index 
will have the index key/value pairs stored, along with the last sequence number 
from the changes feed used to update the index.
+
+Indexes will use the changes feed and be updated via the background tasks 
queue. If the index only needs a very small update, the update can happen in 
the request instead of via the background job queue.
+
+There will be new size limitations on keys (10KB) and values (100KB) that are 
emitted from a map function.
+
+## Requirements Language
+
+[NOTE]: # ( Do not alter the section below. Follow its instructions. )
+
+The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
+"SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
+document are to be interpreted as described in
+[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).
+
+## Terminology
+
+`Sequence`: a 13 byte value formed by combining the current `Incarnation` of 
the database and the `Versionstamp` of the transaction. Sequences are 
monotonically increasing even when a database is relocated across FoundationDB 
clusters. See (RFC002)[LINK TBD]  for a full explanation.
+
+`View Signature`:  A md5 hash of the views, options, view language defined in 
a design document.
+
+---
+
+## Detailed Description
+
+CouchDB views are used to create secondary indexes for the documents stored in 
a CouchDB database. An index is defined by creating a map/reduce functions in a 
design document. This document describes building the map indexes on top of 
FoundationDB (FDB).
+
+### Data model
+
+A map index is created via a design document, an example is shown below:
+
+```json
+{
+  "_id": "_design/design-doc-id",
+  "_rev": "1-8d361a23b4cb8e213f0868ea3d2742c2",
+  "views": {
+    "map-view": {
+      "map": "function (doc) {\n  emit(doc._id, 1);\n}"
+    }
+  },
+  "language": "javascript"
+}
+```
+
+The view’s map function will be used to generate keys and values which will be 
stored in FDB as the secondary index. Format for storing the key/values is:
+
+```json
+{<database>, ?VIEWS, <view_signature>, ?VIEW_KVS, <view_id>, ?MAP, <key>, 
<count>, <_id>} -> <emitted_value>
+```
+
+Where each field is defined as:
+
+* `<database>` is the specific database namespace
+* `?VIEWS` is the standard views namespace.
+* `view_signature` is the design documents `View Signature`
+* `view_id` name of a view defined in the design document
+* `?MAP` is the standard map namespace
+* `key` is the emitted row key from a map function
+* `count` is a value to allow duplicate keys to be emitted for a document
+* `values` is the emitted value from the map function
+
+### Key ordering
+
+FoundationDB orders key by byte value which is not how CouchDB currently 
orders keys. To maintain the way CouchDB currently does view collation, a type 
value will need to be prepended to each key so that the correct sort order of 
null < boolean < numbers < strings < arrays < objects is maintained.
+
+Strings will need an additional change in terms of how they are compared with 
ICU. An ICU sort string will be generated upfront and added to the string key. 
This value will be used to sort the string in FDB. The original string will be 
stored so that it can be used when returning the keys to the user.
+
+CouchDB allows duplicate keys to be emitted for an index, to allow for that a 
counter value `count` will be added to the end of the keys.
+
+### Emitting document
+
+In a map function it is possible to emit the full document as the value, this 
will cause an issue if the document size is larger than FDB’s value limit of 
100 KB. We can handle this in two possible ways.  The first is to keep the hard 
limit of only allowing 100 KB value to be emitted, so if a document exceeds 
that CouchDB will return an error. This is the preferred option.
+
+The second option is to detect that a map function is emitting the full 
document and then add in a foreign key reference back to the document subspace. 
The issue here is that CouchDB would only be able to return the latest version 
of the document, which would cause consistency issues when combined with the 
`update=false` argument.
+
+A third option would be to split the document across multiple keys in 
FoundationDB. This will still be limited by the transaction size limit.
+
+### Index Management
+
+For every document that needs to be processed for an index, we have to run the 
document through the javascript query server to get the emitted keys and 
values. This means that it won’t be possible to update a map/reduce index in 
the same transaction that a document is updated. To account for this, we will 
need to keep an `id index` similar to the `id tree`  that is currently keep in 
CouchDB. This index will hold the document id as the key and the value would be 
the keys that were emitted. CouchDB will use this information to know which 
fields need to be updated, added or removed from the index when a document is 
changed.  A data model for this would be:
+
+`{<database>, ?VIEWS, <view_signature>, ?VIEW_ID_INDEX, <_id>, <view_id>} -> 
[emitted keys]`
+
+Each index will be built and updated via the Background job queue [RFC Link 
TBD]. When a request for a view is received, the request process will add a job 
item onto the background queue for the index to be updated. A worker will take 
the item off the queue and update the index. Once the index has been built, the 
request will return with the results. This process can also be optimised in two 
ways. Firstly, using a new couch_events system to listen for document changes 
in an database and then adding indexing jobs to the queue to keep indexes warm. 
The second optimisation is if the index only requires a small update, rather 
update the index in the http request process instead of doing the work via the 
background queue.
+
+Initially the building of an index will be a single worker running through the 
changes feed and creating the index. Ideally it would be nice to parallelise 
that work so that multiple workers could build the index at the same time. This 
will reduce build times. This can be done by fetching the boundary keys for the 
changes feed, splitting those key ranges amongst different workers to build 
different parts of the index. This will require that for each document update 
processed, the worker must check the revision key space to determine if the 
document is the latest revision of the document. If it is not it should discard 
it. The other requirement is that CouchDB could only start serving the index 
once the update sequence is up to date AND all the workers have completed 
building their section of the index.
+
+### View clean up
+
+When a design document is changed, new indexes will be built and grouped under 
a new `View Signature`. The old map indexes will still be in FDB.  CouchDB will 
need to monitor `View Signatures` and be able to remove old indexes. To do this 
we create the following data model:
+
+`(<database>, ?VIEW_DDOC_IDS, <design_doc_id>) = ViewSignature`
+`(<database>, ?VIEW_SIGS, <view_signature>) = Counter`
+
+When a design document is created, CouchDB will store the design document id 
and the view signature and set the view signature counter to one. On update or 
deletion of a design document, CouchDB will get the old signature and decrement 
the counter. 
+If the counter is 0, then CouchDB will remove the old view index.
 
 Review comment:
   does deletion occur automatically or via the existing `_view_cleanup` 
endpoint? Just noting that it's been useful in the past that users can 
efficiently roll back to a previous version of a design document in the event 
that an updated index definition contains an error.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to