Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Garren Smith Thu, 25 Apr 2019 04:02:31 -0700

I've created an RFC for this
https://github.com/apache/couchdb-documentation/pull/407


Thanks to everyone who participated in this discussion.

Cheers
Garren

On Mon, Apr 15, 2019 at 11:58 AM Garren Smith <gar...@apache.org> wrote:

> Hi Alex,
>
> Thanks for the really detailed explanation on transaction size
> calculation. That is really helpful. Adding a limit to number of indexes is
> a good idea.
>
> Cheers
> Garren
>
> On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <alexmil...@apple.com.invalid>
> wrote:
>
>> There was some discussion on the FDB forums previously about how to do
>> index backfill,
>> If you hadn’t seen it before, but it’s already reasonably similar to what
>> you’ve proposed:
>>
>> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2
>>
>> There's also two related topics that I'd like to provide some context and
>> flavor on:
>> 1. The 10MB transaction size limit.
>> 2. The performance/latency/throughput implications of writing to multiple
>>     disjoint keys versus multiple nearby keys.
>>
>> ### Transaction Size Limit
>>
>> There was some previous concern surrounding the FDB 10MB transaction
>> limit,
>> what it means for CouchDB, and how to communicate the effective maximum
>> document size to users.  Index-on-write will impact this as well, as index
>> writes will be counted against the 10MB limit.
>>
>> As I don't believe I've seen a precise definition yet of the maximum size
>> of a
>> document allowed in CouchDB-on-FDB, let us assume it's defined as
>> "whatever FDB
>> will accept".  It would then be possible that a document can be inserted
>> that
>> just barely fits within the size limits, and then indexes can be defined
>> that
>> apply to the document.  One is now left with a document that exists, and
>> is
>> valid in the database, but couldn't be read and re-inserted into the
>> database,
>> because it would exceed the size limit.  My largest concern would be that
>> this
>> would have bad implications on CouchDB replication.
>>
>> Although the transaction limit is 10MB, it's generally advisable to aim
>> to stay
>> under 1MB for the cluster to perform well.  I've generally seen latency
>> start
>> to get strange for all clients when one client starts exceeding ~5MB.  If
>> one
>> requires the total size of a document to stay less than 1MB, then this
>> comfortably leaves room for indexes in the commit request, and the above
>> case
>> would be relatively unlikely.
>>
>> Exactly how unlikely involves dissecting exactly how the transaction size
>> limit
>> is calculated.
>>
>> The 10MB limit applies to the size of the commit request, and not to the
>> sum of
>> the key-value pair sizes, as one might expect.  The largest difference
>> between
>> the two meanings is that read and write conflict ranges count against the
>> 10MB
>> limit.  The rough formula for calculating commit size is:
>>
>> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
>>
>> Each `get()` adds a read conflict range of the read key and the
>> sequentially
>> next key.  Each `set()` adds a write conflict range the same way, in
>> addition
>> to having to send the actual key and value.  This is the source of the
>> `2*` and
>> `3*` multipliers.
>>
>> `getRange()` adds a read conflict range that exactly reflects the
>> requested
>> range.  However, there is no `setRange()` to automatically optimize the
>> write
>> conflict ranges added for inserting sequential values.  If one is
>> inserting
>> keys that given the data model represent one collective, sequential unit,
>> such
>> as a full document exploded across multiple key-value pairs, it's
>> advisable to
>> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each
>> individual
>> write conflict range into one large write conflict range, and thus greatly
>> reduce the size of the commit request that will be sent.
>>
>> So using this optimization, blindly inserting a document using the scheme
>> proposed by the "exploded KV" approach, one would end up with a formula
>> looking
>> more like
>>
>> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
>> written) $$
>>
>> With index-on-write, we now need to add inserting (prefix/value/key, "")
>> for each
>> index entry into the equation:
>>
>> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
>> written) +
>>    number of indexes * 3 * ( len(prefix) + len(one value) + len(one key)
>> ) $$
>>
>> And there's no similar write conflict range optimization that can be
>> done, as
>> index writes will be non-sequential and spread widely across the keyspace.
>>
>> Index types that can cause a large number of entries to be written will
>> have an
>> even larger affect on total commit size.  I’ve been using the mental
>> example of
>> indexing all the elements of an array nested deep within a document.
>> However,
>> if I’ve read CouchDB documentation correctly, all of the index types that
>> can
>> cause this explosion are grouped as part of Mango Search indexes, and
>> thus out
>> of scope for this proposal.  Mango JSON indexes seem to have a direct
>> correlation
>> that one (index,field) pair will turn into one Key-Value write to FDB.
>> Convenient!
>>
>> Therefore, in order for a user to run into Mango JSON index-induced
>> transaction
>> size limits, they would have to manually add an astounding number of
>> indexes.
>> I've failed to quickly find an answer as to if there's an existing maximum
>> number of Mango JSON indexes that can be added in a CouchDB instance, but
>> if
>> there is, this might not be a concern anyway.
>>
>> The worst case would be that each field of a document is indexed.
>> Assuming the
>> "exploded KV" proposal as the representation, each key and value would be
>> repeated 3 more times, and thus this would turn into
>>
>> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys
>> written) + 4*len(values written) $$
>>
>> So if an maximum exploded KV size of 1MB is chosen, transaction commit
>> requests
>> would be expected to stay roughly under 4MB.
>>
>> However, if you're thinking into the future, and considering that whatever
>> choice is taken for Mango JSON indexes should also be taken for Mango
>> Search
>> indexes as well, the above is definitely a concern.
>>
>> ### Indexing Write Penalty
>>
>> And, as we're now discussing what would happen with thousands of indexes
>> defined, let's briefly discuss what impact that would have on actually
>> committing that transaction.
>>
>> With most distributed databases, each index added equates to an additional
>> shard of data involved in the transaction, and thus likely an additional
>> server
>> and RPC.  This would lead to an increase in latency and concurrency
>> control
>> costs, as more indexes are added.  This adds a strong cost to each
>> additional
>> index.  One much larger than the actual cost of the disk IO involved in
>> doing
>> the actual index write.
>>
>> FoundationDB does all commits against a distributed write-ahead-log, and
>> asynchronously to the commit, applies the data to each shard of data
>> involved in
>> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
>> randomly distributed across the keyspace will be processed in the same
>> way by
>> the transaction subsystem.  One would actually see better throughput at
>> saturation when writing to randomly distributed keys, as the work of
>> applying
>> those mutations gets sharded across multiple storage servers.
>>
>> Creating many indexes still means more disk IO, but is comparatively
>> cheaper on
>> FDB than other distributed database architectures.
>>
>> > On Apr 11, 2019, at 5:42 AM, Garren Smith <gar...@apache.org> wrote:
>> >
>> >
>> > I was chatting to Adam yesterday and I want to explore the
>> index-on-write
>> > indexing for Mango a bit more. I know there has been a bit of a
>> discussion
>> > that we should only use a background process to build mango indexes but
>> I
>> > think that building indexes as documents are created/updated along
>> combined
>> > with background processing for existing documents will be easier to
>> > implement. Especially in the beginning as we build the new fdb layer.
>> >
>> > Below is the process for building a new index:
>> >
>> > 1. When a user defines a new index on an existing database, save the
>> index
>> > definition and also save the sequence that the index was added at. The
>> > index should also be marked that it is in a `building`  phase so it
>> won’t
>> > be used yet to service queries. (I’ll come back to this later)
>> > 2. Any write requests after that must read the new index definition and
>> > update the index. When updating the new index, the writers should assume
>> > that previous versions of the document have already been indexed.
>> > 3. At the same time a background process will start reading sections of
>> the
>> > changes feed and building the index, this background process will keep
>> > processing the changes read until it reaches the sequence number that
>> the
>> > index was saved at. Once it reaches that point, the index is up to date
>> and
>> > will be marked as `active` and can be used to service queries.
>> > 4. There are some subtle behaviour around step 3 that is worth
>> mentioning.
>> > The background process will have the 5 second transaction limit, so it
>> will
>> > process smaller parts of the changes feed. Which means that it won’t
>> have
>> > one consistent view of the changes feed throughout the index building
>> > process. This will lead to a conflict situation. For example when the
>> > background process transaction is adding a document to the index while
>> at
>> > the same time a write request has a transaction that is updating the
>> same
>> > document. There are two possible outcomes to this, if the background
>> > process wins, the write request will get a conflict. At that point the
>> > write request will try to process the document again, read the old
>> values
>> > for that document, remove them from the index and add the new values to
>> the
>> > index. If the write request wins, and the background process gets a
>> > conflict, then the background process can try again, the document would
>> > have been removed from its old position in the changes feed and moved to
>> > the later position, so the background process won’t see the document and
>> > will then move on to the next one.
>> > 5. One other feature to add is to an index progress tracker. We can do
>> this
>> > by using doc_count for the database, and then have a counter value that
>> the
>> > background workers can increment with the number of documents it updated
>> > for each batch update.  We would also have to update this counter on
>> write
>> > requests while the index is in building mode.
>> > 6. Something we can also explore is splitting the building of the index
>> > across multiple workers, we can use the `get_boundary_keys` [1] API
>> call on
>> > the changes feed to get the full list of the changes feed keys grouped
>> by
>> > partition boundaries and then split that by workers.
>> >
>> > Adding a building and active state to the index’s is a bit of a breaking
>> > change, but I think its the right way to go. Currently with Mango what
>> can
>> > happen is a user creates an index and then immediately does a query that
>> > would use that index. Mango would then have to build the whole index
>> before
>> > responding to that request. In this new index-on-write process, Mango
>> would
>> > ignore the new index until it is active which I think is the better way
>> to
>> > go on this.
>> >
>> > Finally, a big acknowledgment to Adam who is the major brains behind
>> this
>> > design.
>> >
>> > What do you think, I would like to hear any thoughts, questions or
>> > suggestions on this design.
>> >
>> > Cheers
>> > Garren
>> >
>> > [1]
>> >
>> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
>> >
>> > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <gar...@apache.org> wrote:
>> >
>> >>
>> >>
>> >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <kocol...@apache.org>
>> wrote:
>> >>
>> >>> Hi Will, great comments, I have replies to a couple of them.
>> >>>
>> >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <willhol...@gmail.com>
>> wrote:
>> >>>>
>> >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>> >>>> whether we can guarantee that the generated keys will fit within the
>> >>>> maximum FDB key length or if there needs to be some thought as to the
>> >>>> failure mode / workaround. As Adam mentioned, it seems fine to store
>> an
>> >>>> encoded key given Mango (currently) always fetches the associated
>> >>> document
>> >>>> / fields from the primary index to filter on anyway. It might even be
>> >>>> beneficial to have an additional layer of indirection and allow
>> multiple
>> >>>> docs to be associated with each row so that we can maintain compact
>> >>> keys.
>> >>>
>> >>> Interesting thought on that layer of indirection; it reminds me of an
>> >>> optimization applied in the Record Layer’s text indexes. Would have to
>> >>> compare whether the extra reads needed to maintain the index that way
>> are
>> >>> an acceptable tradeoff.
>> >>>
>> >>> Good point on the sort key sizes, I’ve not seen any way to place a
>> >>> reliably safe upper bound on the size of one that might be generated.
>> The
>> >>> ICU folks have some hand-wavey guidance at
>> >>>
>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size
>> ,
>> >>> but it seems like we might be able to dig a little deeper.
>> >>>
>> >>> I personally haven’t given much thought to a workaround where a
>> >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle
>> that
>> >>> failure mode safely even without the sort key complication — people
>> try
>> >>> crazy things :)
>> >>>
>> >>
>> >> For the 10 KB error, I think we should just return an error. As a
>> >> comparison, MongoDB has a 1024 Byte limit
>> >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>> >>
>> >>
>> >>>> 3. I don't immediately see how you clear previous values from the
>> index
>> >>>> when a doc is updated, but I could easily be missing something
>> obvious
>> >>> :)
>> >>>
>> >>> Ah yeah, this part wasn’t explicit, was it?
>> >>>
>> >>> I think the idea is that these are simple indexes on specific fields
>> of a
>> >>> document, and we have a data model where those fields are already
>> stored as
>> >>> their own keys in FDB, so there’s no need (in the case of Mango) to
>> >>> maintain a separate docid -> {viewid, [keys]} mapping like we do
>> today in
>> >>> each view group. Rather, the flow would go something like
>> >>>
>> >>> 1) Check which fields are supposed to be indexed
>> >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>> >>> parent revision
>> >>> 3) Compare the parent values with the ones supplied in this
>> transaction;
>> >>> if any indexed values change, clear the old ones and insert the new
>> ones
>> >>>
>> >>> with some additional caveats around checking that the supplied edit is
>> >>> actually going to be winning (and therefore indexed) version after the
>> >>> commit succeeds.
>> >>>
>> >>>> 4. Regarding "Index on write" behaviour, is there something in the
>> >>> existing
>> >>>> design (Mango overlaying mrview / lucene) that would prevent this? I
>> can
>> >>>> see some benefit for certain workloads (and headaches for others)
>> but I
>> >>>> don't see that it's necessarily coupled to the Mango design given
>> >>>> background indexing of new/changed indexes needs to be supported
>> anyway.
>> >>>
>> >>> I’m not sure I understand your question. In my mind the reason “index
>> on
>> >>> write" is more applicable for Mango JSON than for generalized views is
>> >>> because in the view case batching is currently quite important to
>> achieve
>> >>> good throughput to the JS system. You’re of course correct that we
>> need to
>> >>> be able to re-generate Mango JSON indexes in the background as well.
>> >>>
>> >>> Adam
>> >>>
>> >>>
>> >>>
>>
>>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Reply via email to