Heya Adam, thanks for re-focussing this. All I meant to say: let’s get your initial proposal implemented with the constraints you mention and take it from there, not make it even simpler than that.
Best Jan — > On 8. Feb 2019, at 04:35, Adam Kocoloski <kocol...@apache.org> wrote: > > Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a bit > “simplistic" to exclusively choose simplicity over performance and storage > density. We’re (re)building a database here, one that has some users with > pretty demanding performance and scalability requirements. And yes, we should > certainly be testing and measuring. Kyle and team are setting up > infrastructure in IBM land to help with that now, but I also believe we can > design the data model and architecture with a basic performance model of > FoundationDB in mind: > > - reads cost 1ms > - short range reads are the same cost as a single lookup > - reads of independent parts of the keyspace can be parallelized for cheap > - writes are zero-cost until commit time > > We ought to be able to use these assumptions to drive some decisions about > data models ahead of any end-to-end performance test. > > If there are specific elements of the edit conflicts management where you > think greater simplicity is warranted, let’s get those called out. Ilya noted > (correctly, in my opinion) that the term sharing stuff is one of those items. > It’s relatively complex, potentially a performance hit, and only saves on > storage density in the corner case of lots of edit conflicts. That’s a good > one to drop. > > I’m relatively happy with the revision history data model at this point. > Hopefully folks find it easy to grok, and it’s efficient for both reads and > writes. It costs some extra storage for conflict revisions compared to the > current tree representation (up to 16K per edit branch, with default > _revs_limit) but knowing what we know about the performance death spiral for > wide revision trees today I’ll happily make a storage vs. performance > tradeoff here :) > > Setting the shared term approach aside, I’ve still been mulling over the key > structure for the actual document data: > > - I thought about trying to construct a special _conflicts subspace, but I > don’t like that approach because the choice of a “winning" revision can flip > back and forth very quickly with concurrent writers to different edit > branches. I think we really want to have a way for revisions to naturally > sort themselves so the winner is the first or last revision in a list. > > - Assuming we’re using key paths of the form (docid, revision-ish, path, to, > field), the goal here is to find an efficient way to get the last key with > prefix “docid” (assuming winner sorts last), and then all the keys that share > the same (docid, revision-ish) prefix as that one. I see two possible > approaches so far, neither perfect: > > Option 1: Execute a get_key() operation with a key selector that asks for the > last key less than “docid\xFF” (again assuming winner sorts last), and then > do a get_range_startswith() request setting the streaming mode to “want_all” > and the prefix to the docid plus whatever revision-ish we found from the > get_key() request. This is two roundtrips instead of one, but it always > retrieves exactly the right set of keys, and the second step is executed as > fast as possible. > > Option 2: Jump straight to get_range_startswith() request using only “docid” > as the prefix, then cancel the iteration once we reach a revision not equal > to the first one we see. We might transfer too much data, or we might end up > doing multiple roundtrips if the default “iterator” streaming mode sends too > little data to start (I haven’t checked what the default iteration block is > there), but in the typical case of zero edit conflicts we have a good chance > of retrieving the full document in one roundtrip. > > I don’t have a good sense of which option wins out here from a performance > perspective, but they’re both operating on the same data model so easy enough > to test the alternatives. The important bit is getting the revision-ish > things to sort correctly. I think we can do that by generating something like > > revision-ish = NotDeleted/1bit : RevPos : RevHash > > with some suitable order-preserving encoding on the RevPos integer. > > Apologies for the long email. Happy for any comments, either here or over on > IRC. Cheers, > > Adam > >> On Feb 7, 2019, at 4:52 PM, Robert Newson <rnew...@apache.org> wrote: >> >> I think we should choose simple. We can then see if performance is too low >> or storage overhead too high and then see what we can do about it. >> >> B. >> >> -- >> Robert Samuel Newson >> rnew...@apache.org >> >> On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote: >>> We cannot do simple thing if we want to support sharing of JSON terms. I >>> think if we want the simplest path we should move sharing out of the >>> scope. The problem with sharing is we need to know the location of >>> shared terms when we do write. This means that we have to read full >>> document on every write. There might be tricks to replace full document >>> read with some sort of hierarchical signature or sketch of a document. >>> However these tricks do not fall into simplest solution category. We >>> need to choose the design goals: >>> - simple >>> - performance >>> - reduced storage overhead >>> >>> best regards, >>> iilyak >>> >>> On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote: >>>> I’m also in favor of keeping it really simple and then testing and >>>> measuring it. >>>> >>>> What is the best way to measure that we have something that works? I’m not >>>> sure just relying on our current tests will prove that? Should we define >>>> and build some more complex situations e.g docs with lots of conflicts or >>>> docs with wide revisions and make sure we can solve for those? >>>> >>>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote: >>>> >>>>> I’m also very much in favour with starting with the simplest thing that >>>>> can possibly work and doesn’t go against the advertised best practices of >>>>> FoundationDB. Let’s get that going and get a feel for how it all works >>>>> together, before trying to optimise things we can’t measure yet. >>>>> >>>>> Best >>>>> Jan >>>>> — >>>>> >>>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> With the Redwood storage engine under development and with prefix >>>>> elision part of its design, I don’t think we should get too hung up on >>>>> adding complications and indirections in the key space just yet. We >>>>> haven’t >>>>> written a line of code or run any tests, this is premature optimisation. >>>>>> >>>>>> I’d like to focus on the simplest solution that yields all required >>>>> properties. We can embellish later (if warranted). >>>>>> >>>>>> I am intrigued by all the ideas that might allow us cheaper inserts and >>>>> updates than the current code where there are multiple edit branches in >>>>> the >>>>> stored document. >>>>>> >>>>>> B. >>>>>> >>>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote: >>>>>>> >>>>>>> While reading Adam's proposal I came to realize that: we don't have to >>>>> calculate winning revision at read time. >>>>>>> Since FDB's transactions are atomic we can calculate it when we write. >>>>> This means we can just write latest values into separate range. This makes >>>>> lookup of latest version fast. >>>>>>> Another realization is if we want to share values for some json paths >>>>> we would have to introduce a level of indirection. >>>>>>> Bellow is the data model inspired by Adam's idea to share json_paths. >>>>> In this model the json_path is stored in the revision where it was first >>>>> added (we call that revision an owner of a json_path). The values for >>>>> json_path key can be scalar values, parts of scalar values or pointers to >>>>> owner location. >>>>>>> The below snippets are sketches of transactions. >>>>>>> The transactions will include updates to other keys as needed >>>>> (`external_size`, `by_seq` and so on). The revision tree management is >>>>> not >>>>> covered yet. >>>>>>> The `rev -> vsn` indirection is not strictly required. It is added >>>>> because it saves some space since `rev` is a long string and `vsn` is FDB >>>>> versionstamp of fixed size. >>>>>>> >>>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn` >>>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL` >>>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part` >>>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>>>> {another_vsn}` >>>>>>> >>>>>>> ``` >>>>>>> write(txn, doc_id, prev_rev, json): >>>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev") >>>>>>> rev = generate_new_rev() >>>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn >>>>>>> for every json_path in flattened json >>>>>>> - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL >>>>>>> if rev is HEAD: >>>>>>> # this range contains values for all json paths for the latest >>>>> revision (read optimization) >>>>>>> - {NS} / {docid} / _data / {json_path} = latest_value | part >>>>>>> - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>>>> {another_vsn} >>>>>>> txn["{NS} / {doc_id} / _rev"] = rev >>>>>>> >>>>>>> get_current(txn, doc_id): >>>>>>> # there is no sharing of json_paths in this range (read optimization) >>>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data >>>>> / 0xFF" ) >>>>>>> >>>>>>> get_revision(txn, doc_id, rev): >>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>>>> for every json_path in json_paths: >>>>>>> if value has type vsn: >>>>>>> another_vsn = value >>>>>>> value = txn["{NS} / {docid} / {another_vsn} / _data / >>>>> {json_path}"] >>>>>>> result[json_path] = value >>>>>>> >>>>>>> delete_revision(txn, doc_id, rev): >>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>>>> for every json_path in json_paths: >>>>>>> if value has type vsn: >>>>>>> # remove reference to deleted revision from the owner >>>>>>> del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}] >>>>>>> # check if deleted revision of json_path is not used by anything else >>>>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", >>>>> limit=1) == []: >>>>>>> del txn["{NS} / {docid} / {vsn} / _data / {json_path}"] >>>>>>> if vsn is HEAD: >>>>>>> copy range for winning revision into "{NS} / {docid} / _data / >>>>> {json_path}" >>>>>>> ``` >>>>>>> >>>>>>> best regards, >>>>>>> iilyak >>>>>>> >>>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: >>>>>>>> I think it’s fine to start a focused discussion here as it might help >>>>> inform some of the broader debate over in that thread. >>>>>>>> >>>>>>>> As a reminder, today CouchDB writes the entire body of each document >>>>> revision on disk as a separate blob. Edit conflicts that have common >>>>> fields >>>>> between them do not share any storage on disk. The revision tree is >>>>> encoded >>>>> into a compact format and a copy of it is stored directly in both the >>>>> by_id >>>>> tree and the by_seq tree. Each leaf entry in the revision tree contain a >>>>> pointer to the position of the associated doc revision on disk. >>>>>>>> >>>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit conflict >>>>> revisions just from multiple clients concurrently updating the same >>>>> document in a single cluster. This won’t happen when FoundationDB is >>>>> running under the hood, but users who deploy multiple CouchDB or PouchDB >>>>> servers and replicate between them can of course still produce conflicts >>>>> just like they could in CouchDB 1.x, so we need a solution. >>>>>>>> >>>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit >>>>> conflict bodies and 2) revision trees >>>>>>>> >>>>>>>> ## Edit Conflict Storage >>>>>>>> >>>>>>>> The simplest possible solution would be to store each document >>>>> revision separately, like we do today. We could store document bodies with >>>>> (“docid”, “revid”) as the key prefix, and each transaction could clear the >>>>> key range associated with the base revision against which the edit is >>>>> being >>>>> attempted. This would work, but I think we can try to be a bit more clever >>>>> and save on storage space given that we’re splitting JSON documents into >>>>> multiple KV pairs. >>>>>>>> >>>>>>>> One thought I’d had is to introduce a special enum Value which >>>>> indicates that the subtree “beneath” the given Key is in conflict. For >>>>> example, consider the documents >>>>>>>> >>>>>>>> { >>>>>>>> “_id”: “foo”, >>>>>>>> “_rev”: “1-abc”, >>>>>>>> “owner”: “alice”, >>>>>>>> “active”: true >>>>>>>> } >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>>> { >>>>>>>> “_id”: “foo”, >>>>>>>> “_rev”: “1-def”, >>>>>>>> “owner”: “bob”, >>>>>>>> “active”: true >>>>>>>> } >>>>>>>> >>>>>>>> We could represent these using the following set of KVs: >>>>>>>> >>>>>>>> (“foo”, “active”) = true >>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>> (“foo”, “owner”, “1-def”) = “bob” >>>>>>>> >>>>>>>> This approach also extends to conflicts where the two versions have >>>>> different data types. Consider a more complicated example where bob >>>>> dropped >>>>> the “active” field and changed the “owner” field to an object: >>>>>>>> >>>>>>>> { >>>>>>>> “_id”: “foo”, >>>>>>>> “_rev”: “1-def”, >>>>>>>> “owner”: { >>>>>>>> “name”: “bob”, >>>>>>>> “email”: “b...@example.com" >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> Now the set of KVs for “foo” looks like this (note that a missing >>>>> field needs to be handled explicitly): >>>>>>>> >>>>>>>> (“foo”, “active”) = kCONFLICT >>>>>>>> (“foo”, “active”, “1-abc”) = true >>>>>>>> (“foo”, “active”, “1-def”) = kMISSING >>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” >>>>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com” >>>>>>>> >>>>>>>> I like this approach for the common case where documents share most of >>>>> their data in common but have a conflict in a very specific field or set >>>>> of >>>>> fields. >>>>>>>> >>>>>>>> I’ve encountered one important downside, though: an edit that >>>>> replicates in and conflicts with the entire document can cause a bit of a >>>>> data explosion. Consider a case where I have 10 conflicting versions of a >>>>> 100KB document, but the conflicts are all related to a single scalar >>>>> value. >>>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT at >>>>> the root. In this model I now need to list out every path of every one of >>>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s >>>>> technically no worse in the end state than the “zero sharing” case above, >>>>> but one could easily imagine overrunning the transaction size limit this >>>>> way. >>>>>>>> >>>>>>>> I suspect there’s a smart path out of this. Maybe the system detects a >>>>> “default” value for each field and uses that instead of writing out the >>>>> value for every revision in a conflicted subtree. Worth some discussion. >>>>>>>> >>>>>>>> ## Revision Trees >>>>>>>> >>>>>>>> In CouchDB we currently represent revisions as a hash history tree; >>>>> each revision identifier is derived from the content of the revision >>>>> including the revision identifier of its parent. Individual edit branches >>>>> are bounded in *length* (I believe the default is 1000 entries), but the >>>>> number of edit branches is technically unbounded. >>>>>>>> >>>>>>>> The size limits in FoundationDB preclude us from storing the entire >>>>> key tree as a single value; in pathological situations the tree could >>>>> exceed 100KB. Rather, I think it would make sense to store each edit >>>>> *branch* as a separate KV. We stem the branch long before it hits the >>>>> value >>>>> size limit, and in the happy case of no edit conflicts this means we store >>>>> the edit history metadata in a single KV. It also means that we can apply >>>>> an interactive edit without retrieving the entire conflicted revision >>>>> tree; >>>>> we need only retrieve and modify the single branch against which the edit >>>>> is being applied. The downside is that we duplicate historical revision >>>>> identifiers shared by multiple edit branches, but I think this is a >>>>> worthwhile tradeoff. >>>>>>>> >>>>>>>> I would furthermore try to structure the keys so that it is possible >>>>> to retrieve the “winning” revision in a single limit=1 range query. >>>>> Ideally >>>>> I’d like to proide the following properties: >>>>>>>> >>>>>>>> 1) a document read does not need to retrieve the revision tree at all, >>>>> just the winning revision identifier (which would be stored with the rest >>>>> of the doc) >>>>>>>> 2) a document update only needs to read the edit branch of the >>>>> revision tree against which the update is being applied, and it can read >>>>> that branch immediately knowing only the content of the edit that is being >>>>> attempted (i.e., it does not need to read the current version of the >>>>> document itself). >>>>>>>> >>>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the revision >>>>> trees, with keys and values that look like >>>>>>>> >>>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, >>>>> GrandparentRev, …] >>>>>>>> >>>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key should >>>>> be sufficient (with the right encoding) to create a range query that >>>>> automatically selects the “winner” according to CouchDB’s arcane rules, >>>>> which are something like >>>>>>>> >>>>>>>> 1) deleted=false beats deleted=true >>>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones >>>>>>>> 3) RevHashes with larger binary values beat ones with smaller values >>>>>>>> >>>>>>>> =========== >>>>>>>> >>>>>>>> OK, that’s all on this topic from me for now. I think this is a >>>>> particularly exciting area where we start to see the dividends of >>>>> splitting >>>>> up data into multiple KV pairs in FoundationDB :) Cheers, >>>>>>>> >>>>>>>> Adam >>>>>>>> >>>>>>>> >>>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> wrote: >>>>>>>>> >>>>>>>>> This one is quite tightly coupled to the other thread on data model, >>>>> should we start much conversation here before that one gets closer to a >>>>> solution? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Robert Samuel Newson >>>>>>>>> rnew...@apache.org >>>>>>>>> >>>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote: >>>>>>>>>> This is a beginning of a discussion thread about storage of edit >>>>>>>>>> conflicts and everything which relates to revisions. >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Professional Support for Apache CouchDB: >>>>> https://neighbourhood.ie/couchdb-support/ >>>>> >>>>> >>>> > -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/