I've created an RFC for this https://github.com/apache/couchdb-documentation/pull/407
Thanks to everyone who participated in this discussion. Cheers Garren On Mon, Apr 15, 2019 at 11:58 AM Garren Smith <gar...@apache.org> wrote: > Hi Alex, > > Thanks for the really detailed explanation on transaction size > calculation. That is really helpful. Adding a limit to number of indexes is > a good idea. > > Cheers > Garren > > On Fri, Apr 12, 2019 at 3:58 AM Alex Miller <alexmil...@apple.com.invalid> > wrote: > >> There was some discussion on the FDB forums previously about how to do >> index backfill, >> If you hadn’t seen it before, but it’s already reasonably similar to what >> you’ve proposed: >> >> https://forums.foundationdb.org/t/best-way-to-add-an-index-on-already-existing-data/97/2 >> >> There's also two related topics that I'd like to provide some context and >> flavor on: >> 1. The 10MB transaction size limit. >> 2. The performance/latency/throughput implications of writing to multiple >> disjoint keys versus multiple nearby keys. >> >> ### Transaction Size Limit >> >> There was some previous concern surrounding the FDB 10MB transaction >> limit, >> what it means for CouchDB, and how to communicate the effective maximum >> document size to users. Index-on-write will impact this as well, as index >> writes will be counted against the 10MB limit. >> >> As I don't believe I've seen a precise definition yet of the maximum size >> of a >> document allowed in CouchDB-on-FDB, let us assume it's defined as >> "whatever FDB >> will accept". It would then be possible that a document can be inserted >> that >> just barely fits within the size limits, and then indexes can be defined >> that >> apply to the document. One is now left with a document that exists, and >> is >> valid in the database, but couldn't be read and re-inserted into the >> database, >> because it would exceed the size limit. My largest concern would be that >> this >> would have bad implications on CouchDB replication. >> >> Although the transaction limit is 10MB, it's generally advisable to aim >> to stay >> under 1MB for the cluster to perform well. I've generally seen latency >> start >> to get strange for all clients when one client starts exceeding ~5MB. If >> one >> requires the total size of a document to stay less than 1MB, then this >> comfortably leaves room for indexes in the commit request, and the above >> case >> would be relatively unlikely. >> >> Exactly how unlikely involves dissecting exactly how the transaction size >> limit >> is calculated. >> >> The 10MB limit applies to the size of the commit request, and not to the >> sum of >> the key-value pair sizes, as one might expect. The largest difference >> between >> the two meanings is that read and write conflict ranges count against the >> 10MB >> limit. The rough formula for calculating commit size is: >> >> $$ 2*len(keys read) + 3*len(keys written) + len(values written) $$ >> >> Each `get()` adds a read conflict range of the read key and the >> sequentially >> next key. Each `set()` adds a write conflict range the same way, in >> addition >> to having to send the actual key and value. This is the source of the >> `2*` and >> `3*` multipliers. >> >> `getRange()` adds a read conflict range that exactly reflects the >> requested >> range. However, there is no `setRange()` to automatically optimize the >> write >> conflict ranges added for inserting sequential values. If one is >> inserting >> keys that given the data model represent one collective, sequential unit, >> such >> as a full document exploded across multiple key-value pairs, it's >> advisable to >> use `addWriteConflictRange(firstKey, lastKey+1)` to collapse each >> individual >> write conflict range into one large write conflict range, and thus greatly >> reduce the size of the commit request that will be sent. >> >> So using this optimization, blindly inserting a document using the scheme >> proposed by the "exploded KV" approach, one would end up with a formula >> looking >> more like >> >> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values >> written) $$ >> >> With index-on-write, we now need to add inserting (prefix/value/key, "") >> for each >> index entry into the equation: >> >> $$ 2*len(keys read) + 2*len(one key) + len(keys written) + len(values >> written) + >> number of indexes * 3 * ( len(prefix) + len(one value) + len(one key) >> ) $$ >> >> And there's no similar write conflict range optimization that can be >> done, as >> index writes will be non-sequential and spread widely across the keyspace. >> >> Index types that can cause a large number of entries to be written will >> have an >> even larger affect on total commit size. I’ve been using the mental >> example of >> indexing all the elements of an array nested deep within a document. >> However, >> if I’ve read CouchDB documentation correctly, all of the index types that >> can >> cause this explosion are grouped as part of Mango Search indexes, and >> thus out >> of scope for this proposal. Mango JSON indexes seem to have a direct >> correlation >> that one (index,field) pair will turn into one Key-Value write to FDB. >> Convenient! >> >> Therefore, in order for a user to run into Mango JSON index-induced >> transaction >> size limits, they would have to manually add an astounding number of >> indexes. >> I've failed to quickly find an answer as to if there's an existing maximum >> number of Mango JSON indexes that can be added in a CouchDB instance, but >> if >> there is, this might not be a concern anyway. >> >> The worst case would be that each field of a document is indexed. >> Assuming the >> "exploded KV" proposal as the representation, each key and value would be >> repeated 3 more times, and thus this would turn into >> >> $$ 2*len(keys read) + 2*len(one key) + 3*len(prefix) + 4*len(keys >> written) + 4*len(values written) $$ >> >> So if an maximum exploded KV size of 1MB is chosen, transaction commit >> requests >> would be expected to stay roughly under 4MB. >> >> However, if you're thinking into the future, and considering that whatever >> choice is taken for Mango JSON indexes should also be taken for Mango >> Search >> indexes as well, the above is definitely a concern. >> >> ### Indexing Write Penalty >> >> And, as we're now discussing what would happen with thousands of indexes >> defined, let's briefly discuss what impact that would have on actually >> committing that transaction. >> >> With most distributed databases, each index added equates to an additional >> shard of data involved in the transaction, and thus likely an additional >> server >> and RPC. This would lead to an increase in latency and concurrency >> control >> costs, as more indexes are added. This adds a strong cost to each >> additional >> index. One much larger than the actual cost of the disk IO involved in >> doing >> the actual index write. >> >> FoundationDB does all commits against a distributed write-ahead-log, and >> asynchronously to the commit, applies the data to each shard of data >> involved in >> the transaction. This means that a commit of 5 adjacent keys and 5 keys >> randomly distributed across the keyspace will be processed in the same >> way by >> the transaction subsystem. One would actually see better throughput at >> saturation when writing to randomly distributed keys, as the work of >> applying >> those mutations gets sharded across multiple storage servers. >> >> Creating many indexes still means more disk IO, but is comparatively >> cheaper on >> FDB than other distributed database architectures. >> >> > On Apr 11, 2019, at 5:42 AM, Garren Smith <gar...@apache.org> wrote: >> > >> > >> > I was chatting to Adam yesterday and I want to explore the >> index-on-write >> > indexing for Mango a bit more. I know there has been a bit of a >> discussion >> > that we should only use a background process to build mango indexes but >> I >> > think that building indexes as documents are created/updated along >> combined >> > with background processing for existing documents will be easier to >> > implement. Especially in the beginning as we build the new fdb layer. >> > >> > Below is the process for building a new index: >> > >> > 1. When a user defines a new index on an existing database, save the >> index >> > definition and also save the sequence that the index was added at. The >> > index should also be marked that it is in a `building` phase so it >> won’t >> > be used yet to service queries. (I’ll come back to this later) >> > 2. Any write requests after that must read the new index definition and >> > update the index. When updating the new index, the writers should assume >> > that previous versions of the document have already been indexed. >> > 3. At the same time a background process will start reading sections of >> the >> > changes feed and building the index, this background process will keep >> > processing the changes read until it reaches the sequence number that >> the >> > index was saved at. Once it reaches that point, the index is up to date >> and >> > will be marked as `active` and can be used to service queries. >> > 4. There are some subtle behaviour around step 3 that is worth >> mentioning. >> > The background process will have the 5 second transaction limit, so it >> will >> > process smaller parts of the changes feed. Which means that it won’t >> have >> > one consistent view of the changes feed throughout the index building >> > process. This will lead to a conflict situation. For example when the >> > background process transaction is adding a document to the index while >> at >> > the same time a write request has a transaction that is updating the >> same >> > document. There are two possible outcomes to this, if the background >> > process wins, the write request will get a conflict. At that point the >> > write request will try to process the document again, read the old >> values >> > for that document, remove them from the index and add the new values to >> the >> > index. If the write request wins, and the background process gets a >> > conflict, then the background process can try again, the document would >> > have been removed from its old position in the changes feed and moved to >> > the later position, so the background process won’t see the document and >> > will then move on to the next one. >> > 5. One other feature to add is to an index progress tracker. We can do >> this >> > by using doc_count for the database, and then have a counter value that >> the >> > background workers can increment with the number of documents it updated >> > for each batch update. We would also have to update this counter on >> write >> > requests while the index is in building mode. >> > 6. Something we can also explore is splitting the building of the index >> > across multiple workers, we can use the `get_boundary_keys` [1] API >> call on >> > the changes feed to get the full list of the changes feed keys grouped >> by >> > partition boundaries and then split that by workers. >> > >> > Adding a building and active state to the index’s is a bit of a breaking >> > change, but I think its the right way to go. Currently with Mango what >> can >> > happen is a user creates an index and then immediately does a query that >> > would use that index. Mango would then have to build the whole index >> before >> > responding to that request. In this new index-on-write process, Mango >> would >> > ignore the new index until it is active which I think is the better way >> to >> > go on this. >> > >> > Finally, a big acknowledgment to Adam who is the major brains behind >> this >> > design. >> > >> > What do you think, I would like to hear any thoughts, questions or >> > suggestions on this design. >> > >> > Cheers >> > Garren >> > >> > [1] >> > >> https://apple.github.io/foundationdb/api-python.html?highlight=boundary_keys#fdb.locality.fdb.locality.get_boundary_keys >> > >> > On Mon, Apr 8, 2019 at 3:50 PM Garren Smith <gar...@apache.org> wrote: >> > >> >> >> >> >> >> On Tue, Apr 2, 2019 at 3:14 AM Adam Kocoloski <kocol...@apache.org> >> wrote: >> >> >> >>> Hi Will, great comments, I have replies to a couple of them. >> >>> >> >>>> On Apr 1, 2019, at 5:21 AM, Will Holley <willhol...@gmail.com> >> wrote: >> >>>> >> >>>> 2. Does the ICU sort key have a bounded length? Mostly I'm wondering >> >>>> whether we can guarantee that the generated keys will fit within the >> >>>> maximum FDB key length or if there needs to be some thought as to the >> >>>> failure mode / workaround. As Adam mentioned, it seems fine to store >> an >> >>>> encoded key given Mango (currently) always fetches the associated >> >>> document >> >>>> / fields from the primary index to filter on anyway. It might even be >> >>>> beneficial to have an additional layer of indirection and allow >> multiple >> >>>> docs to be associated with each row so that we can maintain compact >> >>> keys. >> >>> >> >>> Interesting thought on that layer of indirection; it reminds me of an >> >>> optimization applied in the Record Layer’s text indexes. Would have to >> >>> compare whether the extra reads needed to maintain the index that way >> are >> >>> an acceptable tradeoff. >> >>> >> >>> Good point on the sort key sizes, I’ve not seen any way to place a >> >>> reliably safe upper bound on the size of one that might be generated. >> The >> >>> ICU folks have some hand-wavey guidance at >> >>> >> http://userguide.icu-project.org/collation/architecture#TOC-Sort-key-size >> , >> >>> but it seems like we might be able to dig a little deeper. >> >>> >> >>> I personally haven’t given much thought to a workaround where a >> >>> user-defined index key exceeds 10 KB. We’ll definitely need to handle >> that >> >>> failure mode safely even without the sort key complication — people >> try >> >>> crazy things :) >> >>> >> >> >> >> For the 10 KB error, I think we should just return an error. As a >> >> comparison, MongoDB has a 1024 Byte limit >> >> https://docs.mongodb.com/manual/reference/limits/#Index-Key-Limit >> >> >> >> >> >>>> 3. I don't immediately see how you clear previous values from the >> index >> >>>> when a doc is updated, but I could easily be missing something >> obvious >> >>> :) >> >>> >> >>> Ah yeah, this part wasn’t explicit, was it? >> >>> >> >>> I think the idea is that these are simple indexes on specific fields >> of a >> >>> document, and we have a data model where those fields are already >> stored as >> >>> their own keys in FDB, so there’s no need (in the case of Mango) to >> >>> maintain a separate docid -> {viewid, [keys]} mapping like we do >> today in >> >>> each view group. Rather, the flow would go something like >> >>> >> >>> 1) Check which fields are supposed to be indexed >> >>> 2) Retrieve values for those fields in the ?DOCUMENTS space for the >> >>> parent revision >> >>> 3) Compare the parent values with the ones supplied in this >> transaction; >> >>> if any indexed values change, clear the old ones and insert the new >> ones >> >>> >> >>> with some additional caveats around checking that the supplied edit is >> >>> actually going to be winning (and therefore indexed) version after the >> >>> commit succeeds. >> >>> >> >>>> 4. Regarding "Index on write" behaviour, is there something in the >> >>> existing >> >>>> design (Mango overlaying mrview / lucene) that would prevent this? I >> can >> >>>> see some benefit for certain workloads (and headaches for others) >> but I >> >>>> don't see that it's necessarily coupled to the Mango design given >> >>>> background indexing of new/changed indexes needs to be supported >> anyway. >> >>> >> >>> I’m not sure I understand your question. In my mind the reason “index >> on >> >>> write" is more applicable for Mango JSON than for generalized views is >> >>> because in the view case batching is currently quite important to >> achieve >> >>> good throughput to the JS system. You’re of course correct that we >> need to >> >>> be able to re-generate Mango JSON indexes in the background as well. >> >>> >> >>> Adam >> >>> >> >>> >> >>> >> >>