Hi Alex, Thanks for the really detailed explanation on transaction size calculation. That is really helpful. Adding a limit to number of indexes is a good idea.
Cheers Garren On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <alexmil...@apple.com.invalid> wrote: > There was some discussion on the FDB forums previously about how to do > index backfill, > If you hadn’t seen it before, but it’s already reasonably similar to what > you’ve proposed: > > https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2 > > There's also two related topics that I'd like to provide some context and > flavor on: > 1. The 10MB transaction size limit. > 2. The performance/latency/throughput implications of writing to multiple > disjoint keys versus multiple nearby keys. > > ### Transaction Size Limit > > There was some previous concern surrounding the FDB 10MB transaction limit, > what it means for CouchDB, and how to communicate the effective maximum > document size to users. Index-on-write will impact this as well, as index > writes will be counted against the 10MB limit. > > As I don't believe I've seen a precise definition yet of the maximum size > of a > document allowed in CouchDB-on-FDB, let us assume it's defined as > "whatever FDB > will accept". It would then be possible that a document can be inserted > that > just barely fits within the size limits, and then indexes can be defined > that > apply to the document. One is now left with a document that exists, and is > valid in the database, but couldn't be read and re-inserted into the > database, > because it would exceed the size limit. My largest concern would be that > this > would have bad implications on CouchDB replication. > > Although the transaction limit is 10MB, it's generally advisable to aim to > stay > under 1MB for the cluster to perform well. I've generally seen latency > start > to get strange for all clients when one client starts exceeding ~5MB. If > one > requires the total size of a document to stay less than 1MB, then this > comfortably leaves room for indexes in the commit request, and the above > case > would be relatively unlikely. > > Exactly how unlikely involves dissecting exactly how the transaction size > limit > is calculated. > > The 10MB limit applies to the size of the commit request, and not to the > sum of > the key-value pair sizes, as one might expect. The largest difference > between > the two meanings is that read and write conflict ranges count against the > 10MB > limit. The rough formula for calculating commit size is: > > $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$ > > Each `get()` adds a read conflict range of the read key and the > sequentially > next key. Each `set()` adds a write conflict range the same way, in > addition > to having to send the actual key and value. This is the source of the > `2*` and > `3*` multipliers. > > `getRange()` adds a read conflict range that exactly reflects the requested > range. However, there is no `setRange()` to automatically optimize the > write > conflict ranges added for inserting sequential values. If one is inserting > keys that given the data model represent one collective, sequential unit, > such > as a full document exploded across multiple key-value pairs, it's > advisable to > use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each > individual > write conflict range into one large write conflict range, and thus greatly > reduce the size of the commit request that will be sent. > > So using this optimization, blindly inserting a document using the scheme > proposed by the "exploded KV" approach, one would end up with a formula > looking > more like > > $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values > written) $$ > > With index-on-write, we now need to add inserting (prefix/value/key, "") > for each > index entry into the equation: > > $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values > written) + > number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) ) > $$ > > And there's no similar write conflict range optimization that can be done, > as > index writes will be non-sequential and spread widely across the keyspace. > > Index types that can cause a large number of entries to be written will > have an > even larger affect on total commit size. I’ve been using the mental > example of > indexing all the elements of an array nested deep within a document. > However, > if I’ve read CouchDB documentation correctly, all of the index types that > can > cause this explosion are grouped as part of Mango Search indexes, and thus > out > of scope for this proposal. Mango JSON indexes seem to have a direct > correlation > that one (index,field) pair will turn into one Key-Value write to FDB. > Convenient! > > Therefore, in order for a user to run into Mango JSON index-induced > transaction > size limits, they would have to manually add an astounding number of > indexes. > I've failed to quickly find an answer as to if there's an existing maximum > number of Mango JSON indexes that can be added in a CouchDB instance, but > if > there is, this might not be a concern anyway. > > The worst case would be that each field of a document is indexed. > Assuming the > "exploded KV" proposal as the representation, each key and value would be > repeated 3 more times, and thus this would turn into > > $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys written) > + 4*len(values written) $$ > > So if an maximum exploded KV size of 1MB is chosen, transaction commit > requests > would be expected to stay roughly under 4MB. > > However, if you're thinking into the future, and considering that whatever > choice is taken for Mango JSON indexes should also be taken for Mango > Search > indexes as well, the above is definitely a concern. > > ### Indexing Write Penalty > > And, as we're now discussing what would happen with thousands of indexes > defined, let's briefly discuss what impact that would have on actually > committing that transaction. > > With most distributed databases, each index added equates to an additional > shard of data involved in the transaction, and thus likely an additional > server > and RPC. This would lead to an increase in latency and concurrency control > costs, as more indexes are added. This adds a strong cost to each > additional > index. One much larger than the actual cost of the disk IO involved in > doing > the actual index write. > > FoundationDB does all commits against a distributed write-ahead-log, and > asynchronously to the commit, applies the data to each shard of data > involved in > the transaction. This means that a commit of 5 adjacent keys and 5 keys > randomly distributed across the keyspace will be processed in the same way > by > the transaction subsystem. One would actually see better throughput at > saturation when writing to randomly distributed keys, as the work of > applying > those mutations gets sharded across multiple storage servers. > > Creating many indexes still means more disk IO, but is comparatively > cheaper on > FDB than other distributed database architectures. > > > On Apr 11, 2019, at 5:42 AM, Garren Smith <gar...@apache.org> wrote: > > > > > > I was chatting to Adam yesterday and I want to explore the index-on-write > > indexing for Mango a bit more. I know there has been a bit of a > discussion > > that we should only use a background process to build mango indexes but I > > think that building indexes as documents are created/updated along > combined > > with background processing for existing documents will be easier to > > implement. Especially in the beginning as we build the new fdb layer. > > > > Below is the process for building a new index: > > > > 1. When a user defines a new index on an existing database, save the > index > > definition and also save the sequence that the index was added at. The > > index should also be marked that it is in a `building` phase so it won’t > > be used yet to service queries. (I’ll come back to this later) > > 2. Any write requests after that must read the new index definition and > > update the index. When updating the new index, the writers should assume > > that previous versions of the document have already been indexed. > > 3. At the same time a background process will start reading sections of > the > > changes feed and building the index, this background process will keep > > processing the changes read until it reaches the sequence number that the > > index was saved at. Once it reaches that point, the index is up to date > and > > will be marked as `active` and can be used to service queries. > > 4. There are some subtle behaviour around step 3 that is worth > mentioning. > > The background process will have the 5 second transaction limit, so it > will > > process smaller parts of the changes feed. Which means that it won’t have > > one consistent view of the changes feed throughout the index building > > process. This will lead to a conflict situation. For example when the > > background process transaction is adding a document to the index while at > > the same time a write request has a transaction that is updating the same > > document. There are two possible outcomes to this, if the background > > process wins, the write request will get a conflict. At that point the > > write request will try to process the document again, read the old values > > for that document, remove them from the index and add the new values to > the > > index. If the write request wins, and the background process gets a > > conflict, then the background process can try again, the document would > > have been removed from its old position in the changes feed and moved to > > the later position, so the background process won’t see the document and > > will then move on to the next one. > > 5. One other feature to add is to an index progress tracker. We can do > this > > by using doc_count for the database, and then have a counter value that > the > > background workers can increment with the number of documents it updated > > for each batch update. We would also have to update this counter on > write > > requests while the index is in building mode. > > 6. Something we can also explore is splitting the building of the index > > across multiple workers, we can use the `get_boundary_keys` [1] API call > on > > the changes feed to get the full list of the changes feed keys grouped by > > partition boundaries and then split that by workers. > > > > Adding a building and active state to the index’s is a bit of a breaking > > change, but I think its the right way to go. Currently with Mango what > can > > happen is a user creates an index and then immediately does a query that > > would use that index. Mango would then have to build the whole index > before > > responding to that request. In this new index-on-write process, Mango > would > > ignore the new index until it is active which I think is the better way > to > > go on this. > > > > Finally, a big acknowledgment to Adam who is the major brains behind this > > design. > > > > What do you think, I would like to hear any thoughts, questions or > > suggestions on this design. > > > > Cheers > > Garren > > > > [1] > > > https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys > > > > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <gar...@apache.org> wrote: > > > >> > >> > >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <kocol...@apache.org> > wrote: > >> > >>> Hi Will, great comments, I have replies to a couple of them. > >>> > >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <willhol...@gmail.com> wrote: > >>>> > >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering > >>>> whether we can guarantee that the generated keys will fit within the > >>>> maximum FDB key length or if there needs to be some thought as to the > >>>> failure mode / workaround. As Adam mentioned, it seems fine to store > an > >>>> encoded key given Mango (currently) always fetches the associated > >>> document > >>>> / fields from the primary index to filter on anyway. It might even be > >>>> beneficial to have an additional layer of indirection and allow > multiple > >>>> docs to be associated with each row so that we can maintain compact > >>> keys. > >>> > >>> Interesting thought on that layer of indirection; it reminds me of an > >>> optimization applied in the Record Layer’s text indexes. Would have to > >>> compare whether the extra reads needed to maintain the index that way > are > >>> an acceptable tradeoff. > >>> > >>> Good point on the sort key sizes, I’ve not seen any way to place a > >>> reliably safe upper bound on the size of one that might be generated. > The > >>> ICU folks have some hand-wavey guidance at > >>> > http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size, > >>> but it seems like we might be able to dig a little deeper. > >>> > >>> I personally haven’t given much thought to a workaround where a > >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle > that > >>> failure mode safely even without the sort key complication — people try > >>> crazy things :) > >>> > >> > >> For the 10 KB error, I think we should just return an error. As a > >> comparison, MongoDB has a 1024 Byte limit > >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit > >> > >> > >>>> 3. I don't immediately see how you clear previous values from the > index > >>>> when a doc is updated, but I could easily be missing something obvious > >>> :) > >>> > >>> Ah yeah, this part wasn’t explicit, was it? > >>> > >>> I think the idea is that these are simple indexes on specific fields > of a > >>> document, and we have a data model where those fields are already > stored as > >>> their own keys in FDB, so there’s no need (in the case of Mango) to > >>> maintain a separate docid -> {viewid, [keys]} mapping like we do today > in > >>> each view group. Rather, the flow would go something like > >>> > >>> 1) Check which fields are supposed to be indexed > >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the > >>> parent revision > >>> 3) Compare the parent values with the ones supplied in this > transaction; > >>> if any indexed values change, clear the old ones and insert the new > ones > >>> > >>> with some additional caveats around checking that the supplied edit is > >>> actually going to be winning (and therefore indexed) version after the > >>> commit succeeds. > >>> > >>>> 4. Regarding "Index on write" behaviour, is there something in the > >>> existing > >>>> design (Mango overlaying mrview / lucene) that would prevent this? I > can > >>>> see some benefit for certain workloads (and headaches for others) but > I > >>>> don't see that it's necessarily coupled to the Mango design given > >>>> background indexing of new/changed indexes needs to be supported > anyway. > >>> > >>> I’m not sure I understand your question. In my mind the reason “index > on > >>> write" is more applicable for Mango JSON than for generalized views is > >>> because in the view case batching is currently quite important to > achieve > >>> good throughput to the JS system. You’re of course correct that we > need to > >>> be able to re-generate Mango JSON indexes in the background as well. > >>> > >>> Adam > >>> > >>> > >>> > >