Hi Alex,

Thanks for the really detailed explanation on transaction size calculation.
That is really helpful. Adding a limit to number of indexes is a good idea.

Cheers
Garren

On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <alexmil...@apple.com.invalid>
wrote:

> There was some discussion on the FDB forums previously about how to do
> index backfill,
> If you hadn’t seen it before, but it’s already reasonably similar to what
> you’ve proposed:
>
> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2
>
> There's also two related topics that I'd like to provide some context and
> flavor on:
> 1. The 10MB transaction size limit.
> 2. The performance/latency/throughput implications of writing to multiple
>     disjoint keys versus multiple nearby keys.
>
> ### Transaction Size Limit
>
> There was some previous concern surrounding the FDB 10MB transaction limit,
> what it means for CouchDB, and how to communicate the effective maximum
> document size to users.  Index-on-write will impact this as well, as index
> writes will be counted against the 10MB limit.
>
> As I don't believe I've seen a precise definition yet of the maximum size
> of a
> document allowed in CouchDB-on-FDB, let us assume it's defined as
> "whatever FDB
> will accept".  It would then be possible that a document can be inserted
> that
> just barely fits within the size limits, and then indexes can be defined
> that
> apply to the document.  One is now left with a document that exists, and is
> valid in the database, but couldn't be read and re-inserted into the
> database,
> because it would exceed the size limit.  My largest concern would be that
> this
> would have bad implications on CouchDB replication.
>
> Although the transaction limit is 10MB, it's generally advisable to aim to
> stay
> under 1MB for the cluster to perform well.  I've generally seen latency
> start
> to get strange for all clients when one client starts exceeding ~5MB.  If
> one
> requires the total size of a document to stay less than 1MB, then this
> comfortably leaves room for indexes in the commit request, and the above
> case
> would be relatively unlikely.
>
> Exactly how unlikely involves dissecting exactly how the transaction size
> limit
> is calculated.
>
> The 10MB limit applies to the size of the commit request, and not to the
> sum of
> the key-value pair sizes, as one might expect.  The largest difference
> between
> the two meanings is that read and write conflict ranges count against the
> 10MB
> limit.  The rough formula for calculating commit size is:
>
> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$
>
> Each `get()` adds a read conflict range of the read key and the
> sequentially
> next key.  Each `set()` adds a write conflict range the same way, in
> addition
> to having to send the actual key and value.  This is the source of the
> `2*` and
> `3*` multipliers.
>
> `getRange()` adds a read conflict range that exactly reflects the requested
> range.  However, there is no `setRange()` to automatically optimize the
> write
> conflict ranges added for inserting sequential values.  If one is inserting
> keys that given the data model represent one collective, sequential unit,
> such
> as a full document exploded across multiple key-value pairs, it's
> advisable to
> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each
> individual
> write conflict range into one large write conflict range, and thus greatly
> reduce the size of the commit request that will be sent.
>
> So using this optimization, blindly inserting a document using the scheme
> proposed by the "exploded KV" approach, one would end up with a formula
> looking
> more like
>
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
> written) $$
>
> With index-on-write, we now need to add inserting (prefix/value/key, "")
> for each
> index entry into the equation:
>
> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values
> written) +
>    number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) )
> $$
>
> And there's no similar write conflict range optimization that can be done,
> as
> index writes will be non-sequential and spread widely across the keyspace.
>
> Index types that can cause a large number of entries to be written will
> have an
> even larger affect on total commit size.  I’ve been using the mental
> example of
> indexing all the elements of an array nested deep within a document.
> However,
> if I’ve read CouchDB documentation correctly, all of the index types that
> can
> cause this explosion are grouped as part of Mango Search indexes, and thus
> out
> of scope for this proposal.  Mango JSON indexes seem to have a direct
> correlation
> that one (index,field) pair will turn into one Key-Value write to FDB.
> Convenient!
>
> Therefore, in order for a user to run into Mango JSON index-induced
> transaction
> size limits, they would have to manually add an astounding number of
> indexes.
> I've failed to quickly find an answer as to if there's an existing maximum
> number of Mango JSON indexes that can be added in a CouchDB instance, but
> if
> there is, this might not be a concern anyway.
>
> The worst case would be that each field of a document is indexed.
> Assuming the
> "exploded KV" proposal as the representation, each key and value would be
> repeated 3 more times, and thus this would turn into
>
> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written)
> + 4*len(values written) $$
>
> So if an maximum exploded KV size of 1MB is chosen, transaction commit
> requests
> would be expected to stay roughly under 4MB.
>
> However, if you're thinking into the future, and considering that whatever
> choice is taken for Mango JSON indexes should also be taken for Mango
> Search
> indexes as well, the above is definitely a concern.
>
> ### Indexing Write Penalty
>
> And, as we're now discussing what would happen with thousands of indexes
> defined, let's briefly discuss what impact that would have on actually
> committing that transaction.
>
> With most distributed databases, each index added equates to an additional
> shard of data involved in the transaction, and thus likely an additional
> server
> and RPC.  This would lead to an increase in latency and concurrency control
> costs, as more indexes are added.  This adds a strong cost to each
> additional
> index.  One much larger than the actual cost of the disk IO involved in
> doing
> the actual index write.
>
> FoundationDB does all commits against a distributed write-ahead-log, and
> asynchronously to the commit, applies the data to each shard of data
> involved in
> the transaction.  This means that a commit of 5 adjacent keys and 5 keys
> randomly distributed across the keyspace will be processed in the same way
> by
> the transaction subsystem.  One would actually see better throughput at
> saturation when writing to randomly distributed keys, as the work of
> applying
> those mutations gets sharded across multiple storage servers.
>
> Creating many indexes still means more disk IO, but is comparatively
> cheaper on
> FDB than other distributed database architectures.
>
> > On Apr 11, 2019, at 5:42 AM, Garren Smith <gar...@apache.org> wrote:
> >
> >
> > I was chatting to Adam yesterday and I want to explore the index-on-write
> > indexing for Mango a bit more. I know there has been a bit of a
> discussion
> > that we should only use a background process to build mango indexes but I
> > think that building indexes as documents are created/updated along
> combined
> > with background processing for existing documents will be easier to
> > implement. Especially in the beginning as we build the new fdb layer.
> >
> > Below is the process for building a new index:
> >
> > 1. When a user defines a new index on an existing database, save the
> index
> > definition and also save the sequence that the index was added at. The
> > index should also be marked that it is in a `building`  phase so it won’t
> > be used yet to service queries. (I’ll come back to this later)
> > 2. Any write requests after that must read the new index definition and
> > update the index. When updating the new index, the writers should assume
> > that previous versions of the document have already been indexed.
> > 3. At the same time a background process will start reading sections of
> the
> > changes feed and building the index, this background process will keep
> > processing the changes read until it reaches the sequence number that the
> > index was saved at. Once it reaches that point, the index is up to date
> and
> > will be marked as `active` and can be used to service queries.
> > 4. There are some subtle behaviour around step 3 that is worth
> mentioning.
> > The background process will have the 5 second transaction limit, so it
> will
> > process smaller parts of the changes feed. Which means that it won’t have
> > one consistent view of the changes feed throughout the index building
> > process. This will lead to a conflict situation. For example when the
> > background process transaction is adding a document to the index while at
> > the same time a write request has a transaction that is updating the same
> > document. There are two possible outcomes to this, if the background
> > process wins, the write request will get a conflict. At that point the
> > write request will try to process the document again, read the old values
> > for that document, remove them from the index and add the new values to
> the
> > index. If the write request wins, and the background process gets a
> > conflict, then the background process can try again, the document would
> > have been removed from its old position in the changes feed and moved to
> > the later position, so the background process won’t see the document and
> > will then move on to the next one.
> > 5. One other feature to add is to an index progress tracker. We can do
> this
> > by using doc_count for the database, and then have a counter value that
> the
> > background workers can increment with the number of documents it updated
> > for each batch update.  We would also have to update this counter on
> write
> > requests while the index is in building mode.
> > 6. Something we can also explore is splitting the building of the index
> > across multiple workers, we can use the `get_boundary_keys` [1] API call
> on
> > the changes feed to get the full list of the changes feed keys grouped by
> > partition boundaries and then split that by workers.
> >
> > Adding a building and active state to the index’s is a bit of a breaking
> > change, but I think its the right way to go. Currently with Mango what
> can
> > happen is a user creates an index and then immediately does a query that
> > would use that index. Mango would then have to build the whole index
> before
> > responding to that request. In this new index-on-write process, Mango
> would
> > ignore the new index until it is active which I think is the better way
> to
> > go on this.
> >
> > Finally, a big acknowledgment to Adam who is the major brains behind this
> > design.
> >
> > What do you think, I would like to hear any thoughts, questions or
> > suggestions on this design.
> >
> > Cheers
> > Garren
> >
> > [1]
> >
> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys
> >
> > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <gar...@apache.org> wrote:
> >
> >>
> >>
> >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <kocol...@apache.org>
> wrote:
> >>
> >>> Hi Will, great comments, I have replies to a couple of them.
> >>>
> >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <willhol...@gmail.com> wrote:
> >>>>
> >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering
> >>>> whether we can guarantee that the generated keys will fit within the
> >>>> maximum FDB key length or if there needs to be some thought as to the
> >>>> failure mode / workaround. As Adam mentioned, it seems fine to store
> an
> >>>> encoded key given Mango (currently) always fetches the associated
> >>> document
> >>>> / fields from the primary index to filter on anyway. It might even be
> >>>> beneficial to have an additional layer of indirection and allow
> multiple
> >>>> docs to be associated with each row so that we can maintain compact
> >>> keys.
> >>>
> >>> Interesting thought on that layer of indirection; it reminds me of an
> >>> optimization applied in the Record Layer’s text indexes. Would have to
> >>> compare whether the extra reads needed to maintain the index that way
> are
> >>> an acceptable tradeoff.
> >>>
> >>> Good point on the sort key sizes, I’ve not seen any way to place a
> >>> reliably safe upper bound on the size of one that might be generated.
> The
> >>> ICU folks have some hand-wavey guidance at
> >>>
> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size,
> >>> but it seems like we might be able to dig a little deeper.
> >>>
> >>> I personally haven’t given much thought to a workaround where a
> >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle
> that
> >>> failure mode safely even without the sort key complication — people try
> >>> crazy things :)
> >>>
> >>
> >> For the 10 KB error, I think we should just return an error. As a
> >> comparison, MongoDB has a 1024 Byte limit
> >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit
> >>
> >>
> >>>> 3. I don't immediately see how you clear previous values from the
> index
> >>>> when a doc is updated, but I could easily be missing something obvious
> >>> :)
> >>>
> >>> Ah yeah, this part wasn’t explicit, was it?
> >>>
> >>> I think the idea is that these are simple indexes on specific fields
> of a
> >>> document, and we have a data model where those fields are already
> stored as
> >>> their own keys in FDB, so there’s no need (in the case of Mango) to
> >>> maintain a separate docid -> {viewid, [keys]} mapping like we do today
> in
> >>> each view group. Rather, the flow would go something like
> >>>
> >>> 1) Check which fields are supposed to be indexed
> >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the
> >>> parent revision
> >>> 3) Compare the parent values with the ones supplied in this
> transaction;
> >>> if any indexed values change, clear the old ones and insert the new
> ones
> >>>
> >>> with some additional caveats around checking that the supplied edit is
> >>> actually going to be winning (and therefore indexed) version after the
> >>> commit succeeds.
> >>>
> >>>> 4. Regarding "Index on write" behaviour, is there something in the
> >>> existing
> >>>> design (Mango overlaying mrview / lucene) that would prevent this? I
> can
> >>>> see some benefit for certain workloads (and headaches for others) but
> I
> >>>> don't see that it's necessarily coupled to the Mango design given
> >>>> background indexing of new/changed indexes needs to be supported
> anyway.
> >>>
> >>> I’m not sure I understand your question. In my mind the reason “index
> on
> >>> write" is more applicable for Mango JSON than for generalized views is
> >>> because in the view case batching is currently quite important to
> achieve
> >>> good throughput to the JS system. You’re of course correct that we
> need to
> >>> be able to re-generate Mango JSON indexes in the background as well.
> >>>
> >>> Adam
> >>>
> >>>
> >>>
>
>

Reply via email to