Thanks everyone for the feedback. I've written up an RFC for this
https://github.com/apache/couchdb-documentation/pull/410
Hopefully, I've done a better job of explaining how the view signature will
be used. I've written up a section for when the document is emitted. I'm
having some doubts on whether we should make it a special case and return
the full doc from the doc namespace or rather just store the keys in the
index. I've written up both options.

On Fri, May 10, 2019 at 1:11 PM Jan Lehnardt <j...@apache.org> wrote:

>
>
> > On 15. Apr 2019, at 16:24, Garren Smith <gar...@apache.org> wrote:
> >
> > Hi Everyone,
> >
> > I want to start a discussion around creating map/reduce view indexes. One
> > way to get views indexes to work with FoundationDB is to break up a view
> > index into indexes for the map functions and indexes for the reduce
> > functions. Along those lines, I’m going to break the discussions into
> two,
> > this discussion around map functions and indexes and then a another one
> on
> > reduce functions and the indexes that go with those.
> >
> > ## Data model
> > For a map function, we need to store the emitted keys and the emitted
> > values:
> >
> > {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
> > <_id>} -> <emitted_value>
>
> in line with the “why keep view groups” question, the key path here
> include both the signature and the doc id and view id which already is a
> unique identifier, what purpose does the ?VIEW_SIGNATURE key have?
>
> >
> > To briefly explain what the above means, it creates a views subspace in a
> > database subspace, then every view defined on a design doc is grouped via
> > the design doc’s view signature. The view_id is the name of the view in
> the
> > design doc - we can look at ways to make that smaller to save some key
> > space. The ?MAP groups the key/value into the view’s map index subspace,
> > then we have the keys that were emitted for the map function and finally
> > the _id field of the document used to create the keys for this row.
> >
> > ## Emitted Value
> > The value stored for the row is the emitted value from the map function.
> > Because we have a limitation on the size of the value field one caveat
> > around this design is that a user will run into issues if they emit a
> > document that exceeds 100KB. In CouchDB we don’t recommend users emitting
> > the doc, but there are some nice speed optimisations you get by emitting
> > the document as the value. With CouchDB on FDB that performance
> > optimisation won’t be required and so we will have to actively discourage
> > users from doing that.
> >
> > Just to note, a user would experience the same issue if they emit a value
> > exceeding 100KB.
> >
> > ## Key ordering
> > There are some changes to how we will manage keys emitted from a map
> > function. For strings we will need to generate a ICU sort string upfront
> > instead of using the ICU comparison. To maintain the way CouchDB
> currently
> > does view collation [1], we need to prepend a type value to each key so
> > that we get the correct sort order of null < boolean < numbers < strings
> <
> > arrays < objects. CouchDB currently allows duplicate keys to be emitted
> for
> > an index, to allow for that a counter value will be added to the end of
> the
> > keys.
> >
> > ## Index Key Management
> > For every document that needs to be processed for an index, we have to
> run
> > the document through the javascript process to get the emitted keys and
> > values. This means that it won’t be possible to update a map/reduce index
> > in the same transaction that a document is updated. To account for this,
> we
> > will need to keep an `id index` similar to the `id tree` we current keep.
> > This index will hold the document id as the key and the value would be
> the
> > keys that were emitted. We would then use this information to know which
> > fields need to be updated or removed from the index when a document is
> > changed.  A data model for this would be:
> >
> > {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
> > <view_id>} -> [emitted keys]
> >
> > ## Updating an index
> > To help in knowing which documents have changed since a view was last
> > updated, we will need to keep the latest update sequence. This will
> change
> > from the really long string we currently have, to using the sequence
> value
> > defined in the _changes RFC [2].
> >
> > Based on all of that, a slightly expanded data model for map functions
> > inside a database subspace would look like:
> >
> > * views
> >    * <signature>
> >        * update_seq
> >        * idtree
> >            * (<_id>, <viewid>) -> [keys]
> >        * views
> >            * <viewid>
> >                * map
> >                    * (<key>, <_id>) -> <value>
> >
> > ## Size limits
> > There are some size limits that are worth listing and keeping in mind.
> >
> > * Emitted keys will not be able to exceed 10 KB
> > * Values cannot exceed 100 KB
> > * Following from Alex’s email on how transaction sizes are calculated
> [3],
> > there could be rare cases where the number of key-value pairs emitted
> for a
> > map function could lead to a transaction either exceeding 10 MB which
> isn’t
> > allowed or exceeding 5 MB which impacts the performance of the cluster.
> We
> > will have to detect for those situations and split the transaction into
> > smaller transactions
> >
> > What do you think of that? Any questions or thoughts on this? Once again
> a
> > big acknowledgment to Adam who did the initial investigation and design
> > ideas around this.
> >
> > Cheers
> > Garren
> >
> > [1]
> >
> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
> > [2] https://github.com/apache/couchdb-documentation/pull/401
> > [3]
> >
> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>

Reply via email to