Hi Adam, > Did you figure out how to avoid reading the entire merged set of json_paths > when you want to write a new revision? I can’t see how to define pointers to > existing KVs (which seems to be where you’re looking to gain storage > efficiencies) without doing that full read.
No. This is still a problem, the full read is required at the moment. best regards, iilyak On 2019/02/06 16:27:44, Adam Kocoloski <kocol...@apache.org> wrote: > Hi Ilya, > > > While reading Adam's proposal I came to realize that: we don't have to > > calculate winning revision at read time. > > Since FDB's transactions are atomic we can calculate it when we write. This > > means we can just write latest values into separate range. This makes > > lookup of latest version fast. > > Yep, that can work. It’d be good if we can avoid extra storage overhead in > the common case where a document only has a single edit branch, but I do like > preserving fast read performance for the “winning” revision in all cases. > > > The `rev -> vsn` indirection is not strictly required. It is added because > > it saves some space since `rev` is a long string and `vsn` is FDB > > versionstamp of fixed size. > > Yep, a document _rev is a number followed by an MD5 hash, so 20 bytes if you > use an int32 for the first part. A versionstamp is 10 bytes, so it does save > a bit of space for any document with more than a couple of fields. > > I didn’t quite grok some parts of the pseudocode you have here. Did you > figure out how to avoid reading the entire merged set of json_paths when you > want to write a new revision? I can’t see how to define pointers to existing > KVs (which seems to be where you’re looking to gain storage efficiencies) > without doing that full read. > > Adam > > > On Feb 5, 2019, at 9:18 PM, Ilya Khlopotov <iil...@apache.org> wrote: > > > > While reading Adam's proposal I came to realize that: we don't have to > > calculate winning revision at read time. > > Since FDB's transactions are atomic we can calculate it when we write. This > > means we can just write latest values into separate range. This makes > > lookup of latest version fast. > > Another realization is if we want to share values for some json paths we > > would have to introduce a level of indirection. > > Bellow is the data model inspired by Adam's idea to share json_paths. In > > this model the json_path is stored in the revision where it was first added > > (we call that revision an owner of a json_path). The values for json_path > > key can be scalar values, parts of scalar values or pointers to owner > > location. > > The below snippets are sketches of transactions. > > The transactions will include updates to other keys as needed > > (`external_size`, `by_seq` and so on). The revision tree management is not > > covered yet. > > The `rev -> vsn` indirection is not strictly required. It is added because > > it saves some space since `rev` is a long string and `vsn` is FDB > > versionstamp of fixed size. > > > > - `{NS} / {docid} / _by_rev / {rev} = vsn` > > - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL` > > - `{NS} / {docid} / _data / {json_path} = latest_value | part` > > - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | > > {another_vsn}` > > > > ``` > > write(txn, doc_id, prev_rev, json): > > txn.add_write_conflict_key("{NS} / {doc_id} / _rev") > > rev = generate_new_rev() > > txn["{NS} / {docid} / _by_rev / {rev}"] = vsn > > for every json_path in flattened json > > - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL > > if rev is HEAD: > > # this range contains values for all json paths for the latest > > revision (read optimization) > > - {NS} / {docid} / _data / {json_path} = latest_value | part > > - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | > > {another_vsn} > > txn["{NS} / {doc_id} / _rev"] = rev > > > > get_current(txn, doc_id): > > # there is no sharing of json_paths in this range (read optimization) > > txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data / > > 0xFF" ) > > > > get_revision(txn, doc_id, rev): > > vsn = txn["{NS} / {docid} / _by_rev / {rev}"] > > json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} > > / {vsn} / {docid} / _data / 0xFF" ) > > for every json_path in json_paths: > > if value has type vsn: > > another_vsn = value > > value = txn["{NS} / {docid} / {another_vsn} / _data / {json_path}"] > > result[json_path] = value > > > > delete_revision(txn, doc_id, rev): > > vsn = txn["{NS} / {docid} / _by_rev / {rev}"] > > json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} > > / {vsn} / {docid} / _data / 0xFF" ) > > for every json_path in json_paths: > > if value has type vsn: > > # remove reference to deleted revision from the owner > > del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}] > > # check if deleted revision of json_path is not used by anything else > > if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", > > limit=1) == []: > > del txn["{NS} / {docid} / {vsn} / _data / {json_path}"] > > if vsn is HEAD: > > copy range for winning revision into "{NS} / {docid} / _data / > > {json_path}" > > ``` > > > > best regards, > > iilyak > > > > On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: > >> I think it’s fine to start a focused discussion here as it might help > >> inform some of the broader debate over in that thread. > >> > >> As a reminder, today CouchDB writes the entire body of each document > >> revision on disk as a separate blob. Edit conflicts that have common > >> fields between them do not share any storage on disk. The revision tree is > >> encoded into a compact format and a copy of it is stored directly in both > >> the by_id tree and the by_seq tree. Each leaf entry in the revision tree > >> contain a pointer to the position of the associated doc revision on disk. > >> > >> As a further reminder, CouchDB 2.x clusters can generate edit conflict > >> revisions just from multiple clients concurrently updating the same > >> document in a single cluster. This won’t happen when FoundationDB is > >> running under the hood, but users who deploy multiple CouchDB or PouchDB > >> servers and replicate between them can of course still produce conflicts > >> just like they could in CouchDB 1.x, so we need a solution. > >> > >> Let’s consider the two sub-topics separately: 1) storage of edit conflict > >> bodies and 2) revision trees > >> > >> ## Edit Conflict Storage > >> > >> The simplest possible solution would be to store each document revision > >> separately, like we do today. We could store document bodies with > >> (“docid”, “revid”) as the key prefix, and each transaction could clear the > >> key range associated with the base revision against which the edit is > >> being attempted. This would work, but I think we can try to be a bit more > >> clever and save on storage space given that we’re splitting JSON documents > >> into multiple KV pairs. > >> > >> One thought I’d had is to introduce a special enum Value which indicates > >> that the subtree “beneath” the given Key is in conflict. For example, > >> consider the documents > >> > >> { > >> “_id”: “foo”, > >> “_rev”: “1-abc”, > >> “owner”: “alice”, > >> “active”: true > >> } > >> > >> and > >> > >> { > >> “_id”: “foo”, > >> “_rev”: “1-def”, > >> “owner”: “bob”, > >> “active”: true > >> } > >> > >> We could represent these using the following set of KVs: > >> > >> (“foo”, “active”) = true > >> (“foo”, “owner”) = kCONFLICT > >> (“foo”, “owner”, “1-abc”) = “alice” > >> (“foo”, “owner”, “1-def”) = “bob” > >> > >> This approach also extends to conflicts where the two versions have > >> different data types. Consider a more complicated example where bob > >> dropped the “active” field and changed the “owner” field to an object: > >> > >> { > >> “_id”: “foo”, > >> “_rev”: “1-def”, > >> “owner”: { > >> “name”: “bob”, > >> “email”: “b...@example.com" > >> } > >> } > >> > >> Now the set of KVs for “foo” looks like this (note that a missing field > >> needs to be handled explicitly): > >> > >> (“foo”, “active”) = kCONFLICT > >> (“foo”, “active”, “1-abc”) = true > >> (“foo”, “active”, “1-def”) = kMISSING > >> (“foo”, “owner”) = kCONFLICT > >> (“foo”, “owner”, “1-abc”) = “alice” > >> (“foo”, “owner”, “1-def”, “name”) = “bob” > >> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com” > >> > >> I like this approach for the common case where documents share most of > >> their data in common but have a conflict in a very specific field or set > >> of fields. > >> > >> I’ve encountered one important downside, though: an edit that replicates > >> in and conflicts with the entire document can cause a bit of a data > >> explosion. Consider a case where I have 10 conflicting versions of a 100KB > >> document, but the conflicts are all related to a single scalar value. Now > >> I replicate in an empty document, and suddenly I have a kCONFLICT at the > >> root. In this model I now need to list out every path of every one of the > >> 10 existing revisions and I end up with a 1MB update. Yuck. That’s > >> technically no worse in the end state than the “zero sharing” case above, > >> but one could easily imagine overrunning the transaction size limit this > >> way. > >> > >> I suspect there’s a smart path out of this. Maybe the system detects a > >> “default” value for each field and uses that instead of writing out the > >> value for every revision in a conflicted subtree. Worth some discussion. > >> > >> ## Revision Trees > >> > >> In CouchDB we currently represent revisions as a hash history tree; each > >> revision identifier is derived from the content of the revision including > >> the revision identifier of its parent. Individual edit branches are > >> bounded in *length* (I believe the default is 1000 entries), but the > >> number of edit branches is technically unbounded. > >> > >> The size limits in FoundationDB preclude us from storing the entire key > >> tree as a single value; in pathological situations the tree could exceed > >> 100KB. Rather, I think it would make sense to store each edit *branch* as > >> a separate KV. We stem the branch long before it hits the value size > >> limit, and in the happy case of no edit conflicts this means we store the > >> edit history metadata in a single KV. It also means that we can apply an > >> interactive edit without retrieving the entire conflicted revision tree; > >> we need only retrieve and modify the single branch against which the edit > >> is being applied. The downside is that we duplicate historical revision > >> identifiers shared by multiple edit branches, but I think this is a > >> worthwhile tradeoff. > >> > >> I would furthermore try to structure the keys so that it is possible to > >> retrieve the “winning” revision in a single limit=1 range query. Ideally > >> I’d like to proide the following properties: > >> > >> 1) a document read does not need to retrieve the revision tree at all, > >> just the winning revision identifier (which would be stored with the rest > >> of the doc) > >> 2) a document update only needs to read the edit branch of the revision > >> tree against which the update is being applied, and it can read that > >> branch immediately knowing only the content of the edit that is being > >> attempted (i.e., it does not need to read the current version of the > >> document itself). > >> > >> So, I’d propose a separate subspace (maybe “_meta”?) for the revision > >> trees, with keys and values that look like > >> > >> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, > >> GrandparentRev, …] > >> > >> The inclusion of IsDeleted, RevPosition and RevHash in the key should be > >> sufficient (with the right encoding) to create a range query that > >> automatically selects the “winner” according to CouchDB’s arcane rules, > >> which are something like > >> > >> 1) deleted=false beats deleted=true > >> 2) longer paths (i.e. higher RevPosition) beat shorter ones > >> 3) RevHashes with larger binary values beat ones with smaller values > >> > >> =========== > >> > >> OK, that’s all on this topic from me for now. I think this is a > >> particularly exciting area where we start to see the dividends of > >> splitting up data into multiple KV pairs in FoundationDB :) Cheers, > >> > >> Adam > >> > >> > >>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> wrote: > >>> > >>> This one is quite tightly coupled to the other thread on data model, > >>> should we start much conversation here before that one gets closer to a > >>> solution? > >>> > >>> -- > >>> Robert Samuel Newson > >>> rnew...@apache.org > >>> > >>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote: > >>>> This is a beginning of a discussion thread about storage of edit > >>>> conflicts and everything which relates to revisions. > >>>> > >>>> > >> > >> > >