Ah, that all sounds good. The only thing I'm not initially seeing as obvious is how we lookup a revision path to extend during replication when the previous revision may be anywhere in the list of $revs_limit revisions. Feels like there might be some sort of trickery to do that efficiently. Although it may also be good enough to also just issue $revs_limit lookups in parallel given that we're maxed out on either $revs_limit or 2*$revs_limit if we have to check for both deleted and not.
On Fri, Feb 8, 2019 at 10:22 AM Adam Kocoloski <kocol...@apache.org> wrote: > > Totally unacceptable! ;) In fact some key bits of that model got dispersed > into at least two separate emails so you’re likely not the only one. I’ll > restate here: > > The size limits in FoundationDB preclude us from storing the entire key tree > as a single value; in pathological situations the tree could exceed 100KB. > Rather, I think it would make sense to store each edit *branch* as a separate > KV. We stem the branch long before it hits the value size limit, and in the > happy case of no edit conflicts this means we store the edit history metadata > in a single KV. It also means that we can apply an interactive edit without > retrieving the entire conflicted revision tree; we need only retrieve and > modify the single branch against which the edit is being applied. The > downside is that we may duplicate historical revision identifiers shared by > multiple edit branches, but I think this is a worthwhile tradeoff. > > I’d also ensure that a document update only needs to read the edit branch KV > against which the update is being applied, and it can read that branch > immediately knowing only the content of the edit that is being attempted > (i.e., it does not need to read the current version of the document itself). > > I think we achieve these goals with a separate subspace (maybe “_meta”?) for > the revision trees, with keys and values that look like > > (“_meta”, DocID, NotDeleted, RevPosition, RevHash) = > (VersionstampForCurrentRev, [ParentRev, GrandparentRev, …]) > > Some notes: > > - including IsDeleted ensures that we can efficiently accept the case where > we upload a new document with the same ID where all previous edit branches > have been deleted; i.e. we can construct a key selector which automatically > tells us there are no deleted=false edit branches > - access to VersionstampForCurrentRev ensures we can clear the old entry in > the by_seq space during the update > - i need to remind myself how we handle an edit attempt which supplies a _rev > representing a deleted leaf. Do we fail that as a conflict? That would be the > natural thing to do here, otherwise we’re forced to check both deleted=false > and deleted=true keys > - the keys can be made to naturally sort so that the winning revision sorts > last, but I don’t believe that’s a requirement here like it is for the actual > document data space > > Cheers, Adam > > > On Feb 8, 2019, at 8:59 AM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > > > >> I’m relatively happy with the revision history data model at this point. > > > > I forgot to make a note, but which of the various models are you > > referring to by "revision history data model". There's been so many > > without firm names that my brain is having a hard time parsing that > > one. > > > > On Thu, Feb 7, 2019 at 9:35 PM Adam Kocoloski <kocol...@apache.org> wrote: > >> > >> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a > >> bit “simplistic" to exclusively choose simplicity over performance and > >> storage density. We’re (re)building a database here, one that has some > >> users with pretty demanding performance and scalability requirements. And > >> yes, we should certainly be testing and measuring. Kyle and team are > >> setting up infrastructure in IBM land to help with that now, but I also > >> believe we can design the data model and architecture with a basic > >> performance model of FoundationDB in mind: > >> > >> - reads cost 1ms > >> - short range reads are the same cost as a single lookup > >> - reads of independent parts of the keyspace can be parallelized for cheap > >> - writes are zero-cost until commit time > >> > >> We ought to be able to use these assumptions to drive some decisions about > >> data models ahead of any end-to-end performance test. > >> > >> If there are specific elements of the edit conflicts management where you > >> think greater simplicity is warranted, let’s get those called out. Ilya > >> noted (correctly, in my opinion) that the term sharing stuff is one of > >> those items. It’s relatively complex, potentially a performance hit, and > >> only saves on storage density in the corner case of lots of edit > >> conflicts. That’s a good one to drop. > >> > >> I’m relatively happy with the revision history data model at this point. > >> Hopefully folks find it easy to grok, and it’s efficient for both reads > >> and writes. It costs some extra storage for conflict revisions compared to > >> the current tree representation (up to 16K per edit branch, with default > >> _revs_limit) but knowing what we know about the performance death spiral > >> for wide revision trees today I’ll happily make a storage vs. performance > >> tradeoff here :) > >> > >> Setting the shared term approach aside, I’ve still been mulling over the > >> key structure for the actual document data: > >> > >> - I thought about trying to construct a special _conflicts subspace, but > >> I don’t like that approach because the choice of a “winning" revision can > >> flip back and forth very quickly with concurrent writers to different edit > >> branches. I think we really want to have a way for revisions to naturally > >> sort themselves so the winner is the first or last revision in a list. > >> > >> - Assuming we’re using key paths of the form (docid, revision-ish, path, > >> to, field), the goal here is to find an efficient way to get the last key > >> with prefix “docid” (assuming winner sorts last), and then all the keys > >> that share the same (docid, revision-ish) prefix as that one. I see two > >> possible approaches so far, neither perfect: > >> > >> Option 1: Execute a get_key() operation with a key selector that asks for > >> the last key less than “docid\xFF” (again assuming winner sorts last), and > >> then do a get_range_startswith() request setting the streaming mode to > >> “want_all” and the prefix to the docid plus whatever revision-ish we found > >> from the get_key() request. This is two roundtrips instead of one, but it > >> always retrieves exactly the right set of keys, and the second step is > >> executed as fast as possible. > >> > >> Option 2: Jump straight to get_range_startswith() request using only > >> “docid” as the prefix, then cancel the iteration once we reach a revision > >> not equal to the first one we see. We might transfer too much data, or we > >> might end up doing multiple roundtrips if the default “iterator” streaming > >> mode sends too little data to start (I haven’t checked what the default > >> iteration block is there), but in the typical case of zero edit conflicts > >> we have a good chance of retrieving the full document in one roundtrip. > >> > >> I don’t have a good sense of which option wins out here from a performance > >> perspective, but they’re both operating on the same data model so easy > >> enough to test the alternatives. The important bit is getting the > >> revision-ish things to sort correctly. I think we can do that by > >> generating something like > >> > >> revision-ish = NotDeleted/1bit : RevPos : RevHash > >> > >> with some suitable order-preserving encoding on the RevPos integer. > >> > >> Apologies for the long email. Happy for any comments, either here or over > >> on IRC. Cheers, > >> > >> Adam > >> > >>> On Feb 7, 2019, at 4:52 PM, Robert Newson <rnew...@apache.org> wrote: > >>> > >>> I think we should choose simple. We can then see if performance is too > >>> low or storage overhead too high and then see what we can do about it. > >>> > >>> B. > >>> > >>> -- > >>> Robert Samuel Newson > >>> rnew...@apache.org > >>> > >>> On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote: > >>>> We cannot do simple thing if we want to support sharing of JSON terms. I > >>>> think if we want the simplest path we should move sharing out of the > >>>> scope. The problem with sharing is we need to know the location of > >>>> shared terms when we do write. This means that we have to read full > >>>> document on every write. There might be tricks to replace full document > >>>> read with some sort of hierarchical signature or sketch of a document. > >>>> However these tricks do not fall into simplest solution category. We > >>>> need to choose the design goals: > >>>> - simple > >>>> - performance > >>>> - reduced storage overhead > >>>> > >>>> best regards, > >>>> iilyak > >>>> > >>>> On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote: > >>>>> I’m also in favor of keeping it really simple and then testing and > >>>>> measuring it. > >>>>> > >>>>> What is the best way to measure that we have something that works? I’m > >>>>> not > >>>>> sure just relying on our current tests will prove that? Should we define > >>>>> and build some more complex situations e.g docs with lots of conflicts > >>>>> or > >>>>> docs with wide revisions and make sure we can solve for those? > >>>>> > >>>>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote: > >>>>> > >>>>>> I’m also very much in favour with starting with the simplest thing that > >>>>>> can possibly work and doesn’t go against the advertised best practices > >>>>>> of > >>>>>> FoundationDB. Let’s get that going and get a feel for how it all works > >>>>>> together, before trying to optimise things we can’t measure yet. > >>>>>> > >>>>>> Best > >>>>>> Jan > >>>>>> — > >>>>>> > >>>>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> With the Redwood storage engine under development and with prefix > >>>>>> elision part of its design, I don’t think we should get too hung up on > >>>>>> adding complications and indirections in the key space just yet. We > >>>>>> haven’t > >>>>>> written a line of code or run any tests, this is premature > >>>>>> optimisation. > >>>>>>> > >>>>>>> I’d like to focus on the simplest solution that yields all required > >>>>>> properties. We can embellish later (if warranted). > >>>>>>> > >>>>>>> I am intrigued by all the ideas that might allow us cheaper inserts > >>>>>>> and > >>>>>> updates than the current code where there are multiple edit branches > >>>>>> in the > >>>>>> stored document. > >>>>>>> > >>>>>>> B. > >>>>>>> > >>>>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote: > >>>>>>>> > >>>>>>>> While reading Adam's proposal I came to realize that: we don't have > >>>>>>>> to > >>>>>> calculate winning revision at read time. > >>>>>>>> Since FDB's transactions are atomic we can calculate it when we > >>>>>>>> write. > >>>>>> This means we can just write latest values into separate range. This > >>>>>> makes > >>>>>> lookup of latest version fast. > >>>>>>>> Another realization is if we want to share values for some json paths > >>>>>> we would have to introduce a level of indirection. > >>>>>>>> Bellow is the data model inspired by Adam's idea to share json_paths. > >>>>>> In this model the json_path is stored in the revision where it was > >>>>>> first > >>>>>> added (we call that revision an owner of a json_path). The values for > >>>>>> json_path key can be scalar values, parts of scalar values or pointers > >>>>>> to > >>>>>> owner location. > >>>>>>>> The below snippets are sketches of transactions. > >>>>>>>> The transactions will include updates to other keys as needed > >>>>>> (`external_size`, `by_seq` and so on). The revision tree management > >>>>>> is not > >>>>>> covered yet. > >>>>>>>> The `rev -> vsn` indirection is not strictly required. It is added > >>>>>> because it saves some space since `rev` is a long string and `vsn` is > >>>>>> FDB > >>>>>> versionstamp of fixed size. > >>>>>>>> > >>>>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn` > >>>>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL` > >>>>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part` > >>>>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | > >>>>>> {another_vsn}` > >>>>>>>> > >>>>>>>> ``` > >>>>>>>> write(txn, doc_id, prev_rev, json): > >>>>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev") > >>>>>>>> rev = generate_new_rev() > >>>>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn > >>>>>>>> for every json_path in flattened json > >>>>>>>> - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL > >>>>>>>> if rev is HEAD: > >>>>>>>> # this range contains values for all json paths for the latest > >>>>>> revision (read optimization) > >>>>>>>> - {NS} / {docid} / _data / {json_path} = latest_value | part > >>>>>>>> - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | > >>>>>> {another_vsn} > >>>>>>>> txn["{NS} / {doc_id} / _rev"] = rev > >>>>>>>> > >>>>>>>> get_current(txn, doc_id): > >>>>>>>> # there is no sharing of json_paths in this range (read optimization) > >>>>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / > >>>>>>>> _data > >>>>>> / 0xFF" ) > >>>>>>>> > >>>>>>>> get_revision(txn, doc_id, rev): > >>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] > >>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", > >>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) > >>>>>>>> for every json_path in json_paths: > >>>>>>>> if value has type vsn: > >>>>>>>> another_vsn = value > >>>>>>>> value = txn["{NS} / {docid} / {another_vsn} / _data / > >>>>>> {json_path}"] > >>>>>>>> result[json_path] = value > >>>>>>>> > >>>>>>>> delete_revision(txn, doc_id, rev): > >>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] > >>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", > >>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) > >>>>>>>> for every json_path in json_paths: > >>>>>>>> if value has type vsn: > >>>>>>>> # remove reference to deleted revision from the owner > >>>>>>>> del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}] > >>>>>>>> # check if deleted revision of json_path is not used by anything else > >>>>>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", > >>>>>> limit=1) == []: > >>>>>>>> del txn["{NS} / {docid} / {vsn} / _data / {json_path}"] > >>>>>>>> if vsn is HEAD: > >>>>>>>> copy range for winning revision into "{NS} / {docid} / _data / > >>>>>> {json_path}" > >>>>>>>> ``` > >>>>>>>> > >>>>>>>> best regards, > >>>>>>>> iilyak > >>>>>>>> > >>>>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: > >>>>>>>>> I think it’s fine to start a focused discussion here as it might > >>>>>>>>> help > >>>>>> inform some of the broader debate over in that thread. > >>>>>>>>> > >>>>>>>>> As a reminder, today CouchDB writes the entire body of each document > >>>>>> revision on disk as a separate blob. Edit conflicts that have common > >>>>>> fields > >>>>>> between them do not share any storage on disk. The revision tree is > >>>>>> encoded > >>>>>> into a compact format and a copy of it is stored directly in both the > >>>>>> by_id > >>>>>> tree and the by_seq tree. Each leaf entry in the revision tree contain > >>>>>> a > >>>>>> pointer to the position of the associated doc revision on disk. > >>>>>>>>> > >>>>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit > >>>>>>>>> conflict > >>>>>> revisions just from multiple clients concurrently updating the same > >>>>>> document in a single cluster. This won’t happen when FoundationDB is > >>>>>> running under the hood, but users who deploy multiple CouchDB or > >>>>>> PouchDB > >>>>>> servers and replicate between them can of course still produce > >>>>>> conflicts > >>>>>> just like they could in CouchDB 1.x, so we need a solution. > >>>>>>>>> > >>>>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit > >>>>>> conflict bodies and 2) revision trees > >>>>>>>>> > >>>>>>>>> ## Edit Conflict Storage > >>>>>>>>> > >>>>>>>>> The simplest possible solution would be to store each document > >>>>>> revision separately, like we do today. We could store document bodies > >>>>>> with > >>>>>> (“docid”, “revid”) as the key prefix, and each transaction could clear > >>>>>> the > >>>>>> key range associated with the base revision against which the edit is > >>>>>> being > >>>>>> attempted. This would work, but I think we can try to be a bit more > >>>>>> clever > >>>>>> and save on storage space given that we’re splitting JSON documents > >>>>>> into > >>>>>> multiple KV pairs. > >>>>>>>>> > >>>>>>>>> One thought I’d had is to introduce a special enum Value which > >>>>>> indicates that the subtree “beneath” the given Key is in conflict. For > >>>>>> example, consider the documents > >>>>>>>>> > >>>>>>>>> { > >>>>>>>>> “_id”: “foo”, > >>>>>>>>> “_rev”: “1-abc”, > >>>>>>>>> “owner”: “alice”, > >>>>>>>>> “active”: true > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> and > >>>>>>>>> > >>>>>>>>> { > >>>>>>>>> “_id”: “foo”, > >>>>>>>>> “_rev”: “1-def”, > >>>>>>>>> “owner”: “bob”, > >>>>>>>>> “active”: true > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> We could represent these using the following set of KVs: > >>>>>>>>> > >>>>>>>>> (“foo”, “active”) = true > >>>>>>>>> (“foo”, “owner”) = kCONFLICT > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > >>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” > >>>>>>>>> > >>>>>>>>> This approach also extends to conflicts where the two versions have > >>>>>> different data types. Consider a more complicated example where bob > >>>>>> dropped > >>>>>> the “active” field and changed the “owner” field to an object: > >>>>>>>>> > >>>>>>>>> { > >>>>>>>>> “_id”: “foo”, > >>>>>>>>> “_rev”: “1-def”, > >>>>>>>>> “owner”: { > >>>>>>>>> “name”: “bob”, > >>>>>>>>> “email”: “b...@example.com" > >>>>>>>>> } > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> Now the set of KVs for “foo” looks like this (note that a missing > >>>>>> field needs to be handled explicitly): > >>>>>>>>> > >>>>>>>>> (“foo”, “active”) = kCONFLICT > >>>>>>>>> (“foo”, “active”, “1-abc”) = true > >>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING > >>>>>>>>> (“foo”, “owner”) = kCONFLICT > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > >>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” > >>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com” > >>>>>>>>> > >>>>>>>>> I like this approach for the common case where documents share most > >>>>>>>>> of > >>>>>> their data in common but have a conflict in a very specific field or > >>>>>> set of > >>>>>> fields. > >>>>>>>>> > >>>>>>>>> I’ve encountered one important downside, though: an edit that > >>>>>> replicates in and conflicts with the entire document can cause a bit > >>>>>> of a > >>>>>> data explosion. Consider a case where I have 10 conflicting versions > >>>>>> of a > >>>>>> 100KB document, but the conflicts are all related to a single scalar > >>>>>> value. > >>>>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT > >>>>>> at > >>>>>> the root. In this model I now need to list out every path of every one > >>>>>> of > >>>>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s > >>>>>> technically no worse in the end state than the “zero sharing” case > >>>>>> above, > >>>>>> but one could easily imagine overrunning the transaction size limit > >>>>>> this > >>>>>> way. > >>>>>>>>> > >>>>>>>>> I suspect there’s a smart path out of this. Maybe the system > >>>>>>>>> detects a > >>>>>> “default” value for each field and uses that instead of writing out the > >>>>>> value for every revision in a conflicted subtree. Worth some > >>>>>> discussion. > >>>>>>>>> > >>>>>>>>> ## Revision Trees > >>>>>>>>> > >>>>>>>>> In CouchDB we currently represent revisions as a hash history tree; > >>>>>> each revision identifier is derived from the content of the revision > >>>>>> including the revision identifier of its parent. Individual edit > >>>>>> branches > >>>>>> are bounded in *length* (I believe the default is 1000 entries), but > >>>>>> the > >>>>>> number of edit branches is technically unbounded. > >>>>>>>>> > >>>>>>>>> The size limits in FoundationDB preclude us from storing the entire > >>>>>> key tree as a single value; in pathological situations the tree could > >>>>>> exceed 100KB. Rather, I think it would make sense to store each edit > >>>>>> *branch* as a separate KV. We stem the branch long before it hits the > >>>>>> value > >>>>>> size limit, and in the happy case of no edit conflicts this means we > >>>>>> store > >>>>>> the edit history metadata in a single KV. It also means that we can > >>>>>> apply > >>>>>> an interactive edit without retrieving the entire conflicted revision > >>>>>> tree; > >>>>>> we need only retrieve and modify the single branch against which the > >>>>>> edit > >>>>>> is being applied. The downside is that we duplicate historical revision > >>>>>> identifiers shared by multiple edit branches, but I think this is a > >>>>>> worthwhile tradeoff. > >>>>>>>>> > >>>>>>>>> I would furthermore try to structure the keys so that it is possible > >>>>>> to retrieve the “winning” revision in a single limit=1 range query. > >>>>>> Ideally > >>>>>> I’d like to proide the following properties: > >>>>>>>>> > >>>>>>>>> 1) a document read does not need to retrieve the revision tree at > >>>>>>>>> all, > >>>>>> just the winning revision identifier (which would be stored with the > >>>>>> rest > >>>>>> of the doc) > >>>>>>>>> 2) a document update only needs to read the edit branch of the > >>>>>> revision tree against which the update is being applied, and it can > >>>>>> read > >>>>>> that branch immediately knowing only the content of the edit that is > >>>>>> being > >>>>>> attempted (i.e., it does not need to read the current version of the > >>>>>> document itself). > >>>>>>>>> > >>>>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the > >>>>>>>>> revision > >>>>>> trees, with keys and values that look like > >>>>>>>>> > >>>>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, > >>>>>> GrandparentRev, …] > >>>>>>>>> > >>>>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key > >>>>>>>>> should > >>>>>> be sufficient (with the right encoding) to create a range query that > >>>>>> automatically selects the “winner” according to CouchDB’s arcane rules, > >>>>>> which are something like > >>>>>>>>> > >>>>>>>>> 1) deleted=false beats deleted=true > >>>>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones > >>>>>>>>> 3) RevHashes with larger binary values beat ones with smaller values > >>>>>>>>> > >>>>>>>>> =========== > >>>>>>>>> > >>>>>>>>> OK, that’s all on this topic from me for now. I think this is a > >>>>>> particularly exciting area where we start to see the dividends of > >>>>>> splitting > >>>>>> up data into multiple KV pairs in FoundationDB :) Cheers, > >>>>>>>>> > >>>>>>>>> Adam > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> This one is quite tightly coupled to the other thread on data > >>>>>>>>>> model, > >>>>>> should we start much conversation here before that one gets closer to a > >>>>>> solution? > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Robert Samuel Newson > >>>>>>>>>> rnew...@apache.org > >>>>>>>>>> > >>>>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote: > >>>>>>>>>>> This is a beginning of a discussion thread about storage of edit > >>>>>>>>>>> conflicts and everything which relates to revisions. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> Professional Support for Apache CouchDB: > >>>>>> https://neighbourhood.ie/couchdb-support/ > >>>>>> > >>>>>> > >>>>> > >> >