Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Alex Miller Thu, 11 Apr 2019 18:59:47 -0700

There was some discussion on the FDB forums previously about how to do index 
backfill,
If you hadn’t seen it before, but it’s already reasonably similar to what 
you’ve proposed:
https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2

There's also two related topics that I'd like to provide some context and 
flavor on:
1. The 10MB transaction size limit.
2. The performance/latency/throughput implications of writing to multiple
    disjoint keys versus multiple nearby keys.

### Transaction Size Limit

There was some previous concern surrounding the FDB 10MB transaction limit,
what it means for CouchDB, and how to communicate the effective maximum
document size to users.  Index-on-write will impact this as well, as index
writes will be counted against the 10MB limit.

As I don't believe I've seen a precise definition yet of the maximum size of a
document allowed in CouchDB-on-FDB, let us assume it's defined as "whatever FDB
will accept".  It would then be possible that a document can be inserted that
just barely fits within the size limits, and then indexes can be defined that
apply to the document.  One is now left with a document that exists, and is
valid in the database, but couldn't be read and re-inserted into the database,
because it would exceed the size limit.  My largest concern would be that this
would have bad implications on CouchDB replication.

Although the transaction limit is 10MB, it's generally advisable to aim to stay
under 1MB for the cluster to perform well.  I've generally seen latency start
to get strange for all clients when one client starts exceeding ~5MB.  If one
requires the total size of a document to stay less than 1MB, then this
comfortably leaves room for indexes in the commit request, and the above case
would be relatively unlikely.

Exactly how unlikely involves dissecting exactly how the transaction size limit
is calculated.

The 10MB limit applies to the size of the commit request, and not to the sum of
the key-value pair sizes, as one might expect.  The largest difference between
the two meanings is that read and write conflict ranges count against the 10MB
limit.  The rough formula for calculating commit size is:

$$ 2*len(keys read) + 3*len(keys written) + len(values written) $$

Each `get()` adds a read conflict range of the read key and the sequentially
next key.  Each `set()` adds a write conflict range the same way, in addition
to having to send the actual key and value.  This is the source of the `2*` and
`3*` multipliers.

`getRange()` adds a read conflict range that exactly reflects the requested
range.  However, there is no `setRange()` to automatically optimize the write
conflict ranges added for inserting sequential values.  If one is inserting
keys that given the data model represent one collective, sequential unit, such
as a full document exploded across multiple key-value pairs, it's advisable to
use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each individual
write conflict range into one large write conflict range, and thus greatly
reduce the size of the commit request that will be sent.

So using this optimization, blindly inserting a document using the scheme
proposed by the "exploded KV" approach, one would end up with a formula looking
more like

$$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) 
$$

With index-on-write, we now need to add inserting (prefix/value/key, "") for 
each
index entry into the equation:

$$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values written) +
   number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) ) $$

And there's no similar write conflict range optimization that can be done, as
index writes will be non-sequential and spread widely across the keyspace.

Index types that can cause a large number of entries to be written will have an
even larger affect on total commit size.  I’ve been using the mental example of
indexing all the elements of an array nested deep within a document.  However,
if I’ve read CouchDB documentation correctly, all of the index types that can
cause this explosion are grouped as part of Mango Search indexes, and thus out
of scope for this proposal.  Mango JSON indexes seem to have a direct 
correlation
that one (index,field) pair will turn into one Key-Value write to FDB. 
Convenient!

Therefore, in order for a user to run into Mango JSON index-induced transaction
size limits, they would have to manually add an astounding number of indexes.
I've failed to quickly find an answer as to if there's an existing maximum
number of Mango JSON indexes that can be added in a CouchDB instance, but if
there is, this might not be a concern anyway.

The worst case would be that each field of a document is indexed.  Assuming the
"exploded KV" proposal as the representation, each key and value would be
repeated 3 more times, and thus this would turn into

$$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written) + 
4*len(values written) $$

So if an maximum exploded KV size of 1MB is chosen, transaction commit requests
would be expected to stay roughly under 4MB.

However, if you're thinking into the future, and considering that whatever
choice is taken for Mango JSON indexes should also be taken for Mango Search
indexes as well, the above is definitely a concern.

### Indexing Write Penalty

And, as we're now discussing what would happen with thousands of indexes
defined, let's briefly discuss what impact that would have on actually
committing that transaction.

With most distributed databases, each index added equates to an additional
shard of data involved in the transaction, and thus likely an additional server
and RPC.  This would lead to an increase in latency and concurrency control
costs, as more indexes are added.  This adds a strong cost to each additional
index.  One much larger than the actual cost of the disk IO involved in doing
the actual index write.

FoundationDB does all commits against a distributed write-ahead-log, and
asynchronously to the commit, applies the data to each shard of data involved in
the transaction.  This means that a commit of 5 adjacent keys and 5 keys
randomly distributed across the keyspace will be processed in the same way by
the transaction subsystem.  One would actually see better throughput at
saturation when writing to randomly distributed keys, as the work of applying
those mutations gets sharded across multiple storage servers.

Creating many indexes still means more disk IO, but is comparatively cheaper on
FDB than other distributed database architectures.

> On Apr 11, 2019, at 5:42 AM, Garren Smith <gar...@apache.org> wrote:
> 
> 
> I was chatting to Adam yesterday and I want to explore the index-on-write
> indexing for Mango a bit more. I know there has been a bit of a discussion
> that we should only use a background process to build mango indexes but I
> think that building indexes as documents are created/updated along combined
> with background processing for existing documents will be easier to
> implement. Especially in the beginning as we build the new fdb layer.
> 
> Below is the process for building a new index:
> 
> 1. When a user defines a new index on an existing database, save the index
> definition and also save the sequence that the index was added at. The
> index should also be marked that it is in a `building`  phase so it won’t
> be used yet to service queries. (I’ll come back to this later)
> 2. Any write requests after that must read the new index definition and
> update the index. When updating the new index, the writers should assume
> that previous versions of the document have already been indexed.
> 3. At the same time a background process will start reading sections of the
> changes feed and building the index, this background process will keep
> processing the changes read until it reaches the sequence number that the
> index was saved at. Once it reaches that point, the index is up to date and
> will be marked as `active` and can be used to service queries.
> 4. There are some subtle behaviour around step 3 that is worth mentioning.
> The background process will have the 5 second transaction limit, so it will
> process smaller parts of the changes feed. Which means that it won’t have
> one consistent view of the changes feed throughout the index building
> process. This will lead to a conflict situation. For example when the
> background process transaction is adding a document to the index while at
> the same time a write request has a transaction that is updating the same
> document. There are two possible outcomes to this, if the background
> process wins, the write request will get a conflict. At that point the
> write request will try to process the document again, read the old values
> for that document, remove them from the index and add the new values to the
> index. If the write request wins, and the background process gets a
> conflict, then the background process can try again, the document would
> have been removed from its old position in the changes feed and moved to
> the later position, so the background process won’t see the document and
> will then move on to the next one.
> 5. One other feature to add is to an index progress tracker. We can do this
> by using doc_count for the database, and then have a counter value that the
> background workers can increment with the number of documents it updated
> for each batch update.  We would also have to update this counter on write
> requests while the index is in building mode.
> 6. Something we can also explore is splitting the building of the index
> across multiple workers, we can use the `get_boundary_keys` [1] API call on
> the changes feed to get the full list of the changes feed keys grouped by
> partition boundaries and then split that by workers.
> 
> Adding a building and active state to the index’s is a bit of a breaking
> change, but I think its the right way to go. Currently with Mango what can
> happen is a user creates an index and then immediately does a query that
> would use that index. Mango would then have to build the whole index before
> responding to that request. In this new index-on-write process, Mango would
> ignore the new index until it is active which I think is the better way to
> go on this.
> 
> Finally, a big acknowledgment to Adam who is the major brains behind this
> design.
> 
> What do you think, I would like to hear any thoughts, questions or
> suggestions on this design.
> 
> Cheers
> Garren
> 
> [1]
> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
> 
> On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <gar...@apache.org> wrote:
> 
>> 
>> 
>> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <kocol...@apache.org> wrote:
>> 
>>> Hi Will, great comments, I have replies to a couple of them.
>>> 
>>>> On Apr 1, 2019, at 5:21 AM, Will Holley <willhol...@gmail.com> wrote:
>>>> 
>>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
>>>> whether we can guarantee that the generated keys will fit within the
>>>> maximum FDB key length or if there needs to be some thought as to the
>>>> failure mode / workaround. As Adam mentioned, it seems fine to store an
>>>> encoded key given Mango (currently) always fetches the associated
>>> document
>>>> / fields from the primary index to filter on anyway. It might even be
>>>> beneficial to have an additional layer of indirection and allow multiple
>>>> docs to be associated with each row so that we can maintain compact
>>> keys.
>>> 
>>> Interesting thought on that layer of indirection; it reminds me of an
>>> optimization applied in the Record Layer’s text indexes. Would have to
>>> compare whether the extra reads needed to maintain the index that way are
>>> an acceptable tradeoff.
>>> 
>>> Good point on the sort key sizes, I’ve not seen any way to place a
>>> reliably safe upper bound on the size of one that might be generated. The
>>> ICU folks have some hand-wavey guidance at
>>> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
>>> but it seems like we might be able to dig a little deeper.
>>> 
>>> I personally haven’t given much thought to a workaround where a
>>> user-defined index key exceeds 10 KB. We’ll definitely need to handle that
>>> failure mode safely even without the sort key complication — people try
>>> crazy things :)
>>> 
>> 
>> For the 10 KB error, I think we should just return an error. As a
>> comparison, MongoDB has a 1024 Byte limit
>> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
>> 
>> 
>>>> 3. I don't immediately see how you clear previous values from the index
>>>> when a doc is updated, but I could easily be missing something obvious
>>> :)
>>> 
>>> Ah yeah, this part wasn’t explicit, was it?
>>> 
>>> I think the idea is that these are simple indexes on specific fields of a
>>> document, and we have a data model where those fields are already stored as
>>> their own keys in FDB, so there’s no need (in the case of Mango) to
>>> maintain a separate docid -> {viewid, [keys]} mapping like we do today in
>>> each view group. Rather, the flow would go something like
>>> 
>>> 1) Check which fields are supposed to be indexed
>>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
>>> parent revision
>>> 3) Compare the parent values with the ones supplied in this transaction;
>>> if any indexed values change, clear the old ones and insert the new ones
>>> 
>>> with some additional caveats around checking that the supplied edit is
>>> actually going to be winning (and therefore indexed) version after the
>>> commit succeeds.
>>> 
>>>> 4. Regarding "Index on write" behaviour, is there something in the
>>> existing
>>>> design (Mango overlaying mrview / lucene) that would prevent this? I can
>>>> see some benefit for certain workloads (and headaches for others) but I
>>>> don't see that it's necessarily coupled to the Mango design given
>>>> background indexing of new/changed indexes needs to be supported anyway.
>>> 
>>> I’m not sure I understand your question. In my mind the reason “index on
>>> write" is more applicable for Mango JSON than for generalized views is
>>> because in the view case batching is currently quite important to achieve
>>> good throughput to the JS system. You’re of course correct that we need to
>>> be able to re-generate Mango JSON indexes in the background as well.
>>> 
>>> Adam
>>> 
>>> 
>>>

Re: [DISCUSS] Implementing Mango Indexes for FoundationDB

Reply via email to