Re: [DISCUSS] Map Indexes

Garren Smith Tue, 16 Apr 2019 05:13:02 -0700

Hi Will,

Comments inline.


On Mon, Apr 15, 2019 at 5:24 PM Will Holley <[email protected]> wrote:

> Thanks Garren,
>
> As usual, a few questions :)
>
> 1. The data model suggests the idea of view groups gets carried over to
> fdb. Are there API / behaviour reasons to keep them? Would an index update
> transaction scope to a view group rather than a single view?
>

I haven't given this a lot of thought other than keeping some similarity to
how views are grouped together now. Do you think we should rather store it
another way and not worry about view groups?
Setting the transaction scope to the whole view group is a nice idea. I
think we will only really know if that will work after doing a bit of
testing. My main concern here is how complicated will the view updating
code get if we have to build in logic to try to update a whole view group,
and then if that fails due to transaction size limit or time limit and we
then have to scale back the transaction down to a per view transaction.


> 2. Regarding emitting doc as the value in a view function, this is so
> common that I wonder if it's worth handling as a special case. It sounds
> like there wouldn't be a solution for customers who use this technique to
> ensure they can retrieve the version of the document that is consistent
> with the emitted key?
>

Awesome idea, a special case does make sense. As Adam mentions we need to
do some performance testing to determine the best solution here.


> 3. When you say "Emitted keys will not be able to exceed 10 KB", do you
> mean any single emitted key cannot exceed 10KB? The "id index" proposal
> suggests there would also be a 100KB limit on the combined emitted key
> length.
>

There would be a 100 KB limit on the combined emitted keys stored as values
in the 'id index'. But then we use the emitted keys in the key for each row
in the index, so the combined emitted keys would also be limited to 10 KB.



> Cheers,
>
> Will
>
> On Mon, 15 Apr 2019 at 15:25, Garren Smith <[email protected]> wrote:
>
> > Hi Everyone,
> >
> > I want to start a discussion around creating map/reduce view indexes. One
> > way to get views indexes to work with FoundationDB is to break up a view
> > index into indexes for the map functions and indexes for the reduce
> > functions. Along those lines, I’m going to break the discussions into
> two,
> > this discussion around map functions and indexes and then a another one
> on
> > reduce functions and the indexes that go with those.
> >
> > ## Data model
> > For a map function, we need to store the emitted keys and the emitted
> > values:
> >
> > {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
> > <_id>} -> <emitted_value>
> >
> > To briefly explain what the above means, it creates a views subspace in a
> > database subspace, then every view defined on a design doc is grouped via
> > the design doc’s view signature. The view_id is the name of the view in
> the
> > design doc - we can look at ways to make that smaller to save some key
> > space. The ?MAP groups the key/value into the view’s map index subspace,
> > then we have the keys that were emitted for the map function and finally
> > the _id field of the document used to create the keys for this row.
> >
> > ## Emitted Value
> > The value stored for the row is the emitted value from the map function.
> > Because we have a limitation on the size of the value field one caveat
> > around this design is that a user will run into issues if they emit a
> > document that exceeds 100KB. In CouchDB we don’t recommend users emitting
> > the doc, but there are some nice speed optimisations you get by emitting
> > the document as the value. With CouchDB on FDB that performance
> > optimisation won’t be required and so we will have to actively discourage
> > users from doing that.
> >
> > Just to note, a user would experience the same issue if they emit a value
> > exceeding 100KB.
> >
> > ## Key ordering
> > There are some changes to how we will manage keys emitted from a map
> > function. For strings we will need to generate a ICU sort string upfront
> > instead of using the ICU comparison. To maintain the way CouchDB
> currently
> > does view collation [1], we need to prepend a type value to each key so
> > that we get the correct sort order of null < boolean < numbers < strings
> <
> > arrays < objects. CouchDB currently allows duplicate keys to be emitted
> for
> > an index, to allow for that a counter value will be added to the end of
> the
> > keys.
> >
> > ## Index Key Management
> > For every document that needs to be processed for an index, we have to
> run
> > the document through the javascript process to get the emitted keys and
> > values. This means that it won’t be possible to update a map/reduce index
> > in the same transaction that a document is updated. To account for this,
> we
> > will need to keep an `id index` similar to the `id tree` we current keep.
> > This index will hold the document id as the key and the value would be
> the
> > keys that were emitted. We would then use this information to know which
> > fields need to be updated or removed from the index when a document is
> > changed.  A data model for this would be:
> >
> > {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
> > <view_id>} -> [emitted keys]
> >
> > ## Updating an index
> > To help in knowing which documents have changed since a view was last
> > updated, we will need to keep the latest update sequence. This will
> change
> > from the really long string we currently have, to using the sequence
> value
> > defined in the _changes RFC [2].
> >
> > Based on all of that, a slightly expanded data model for map functions
> > inside a database subspace would look like:
> >
> > * views
> >     * <signature>
> >         * update_seq
> >         * idtree
> >             * (<_id>, <viewid>) -> [keys]
> >         * views
> >             * <viewid>
> >                 * map
> >                     * (<key>, <_id>) -> <value>
> >
> > ## Size limits
> > There are some size limits that are worth listing and keeping in mind.
> >
> > * Emitted keys will not be able to exceed 10 KB
> > * Values cannot exceed 100 KB
> > * Following from Alex’s email on how transaction sizes are calculated
> [3],
> > there could be rare cases where the number of key-value pairs emitted
> for a
> > map function could lead to a transaction either exceeding 10 MB which
> isn’t
> > allowed or exceeding 5 MB which impacts the performance of the cluster.
> We
> > will have to detect for those situations and split the transaction into
> > smaller transactions
> >
> > What do you think of that? Any questions or thoughts on this? Once again
> a
> > big acknowledgment to Adam who did the initial investigation and design
> > ideas around this.
> >
> > Cheers
> > Garren
> >
> > [1]
> >
> >
> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
> > [2] https://github.com/apache/couchdb-documentation/pull/401
> > [3]
> >
> >
> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E
> >
>

Re: [DISCUSS] Map Indexes

Reply via email to