Hi Adam,

> Did you figure out how to avoid reading the entire merged set of json_paths 
> when you want to write a new revision? I can’t see how to define pointers to 
> existing KVs (which seems to be where you’re looking to gain storage 
> efficiencies) without doing that full read.

No. This is still a problem, the full read is required at the moment.

best regards,
iilyak

On 2019/02/06 16:27:44, Adam Kocoloski <kocol...@apache.org> wrote: 
> Hi Ilya,
> 
> > While reading Adam's proposal I came to realize that: we don't have to 
> > calculate winning revision at read time.
> > Since FDB's transactions are atomic we can calculate it when we write. This 
> > means we can just write latest values into separate range. This makes 
> > lookup of latest version fast.
> 
> Yep, that can work. It’d be good if we can avoid extra storage overhead in 
> the common case where a document only has a single edit branch, but I do like 
> preserving fast read performance for the “winning” revision in all cases.
> 
> > The `rev -> vsn` indirection is not strictly required. It is added because 
> > it saves some space since `rev` is a long string and `vsn` is FDB 
> > versionstamp of fixed size.
> 
> Yep, a document _rev is a number followed by an MD5 hash, so 20 bytes if you 
> use an int32 for the first part. A versionstamp is 10 bytes, so it does save 
> a bit of space for any document with more than a couple of fields.
> 
> I didn’t quite grok some parts of the pseudocode you have here. Did you 
> figure out how to avoid reading the entire merged set of json_paths when you 
> want to write a new revision? I can’t see how to define pointers to existing 
> KVs (which seems to be where you’re looking to gain storage efficiencies) 
> without doing that full read.
> 
> Adam
> 
> > On Feb 5, 2019, at 9:18 PM, Ilya Khlopotov <iil...@apache.org> wrote:
> > 
> > While reading Adam's proposal I came to realize that: we don't have to 
> > calculate winning revision at read time.
> > Since FDB's transactions are atomic we can calculate it when we write. This 
> > means we can just write latest values into separate range. This makes 
> > lookup of latest version fast.
> > Another realization is if we want to share values for some json paths we 
> > would have to introduce a level of indirection. 
> > Bellow is the data model inspired by Adam's idea to share json_paths. In 
> > this model the json_path is stored in the revision where it was first added 
> > (we call that revision an owner of a json_path). The values for json_path 
> > key can be scalar values, parts of scalar values or pointers to owner 
> > location.
> > The below snippets are sketches of transactions. 
> > The transactions will include updates to other keys as needed 
> > (`external_size`, `by_seq` and so on).  The revision tree management is not 
> > covered yet.
> > The `rev -> vsn` indirection is not strictly required. It is added because 
> > it saves some space since `rev` is a long string and `vsn` is FDB 
> > versionstamp of fixed size.
> > 
> > - `{NS} / {docid} / _by_rev / {rev} = vsn` 
> > - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
> > - `{NS} / {docid} / _data / {json_path} = latest_value | part`
> > - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | 
> > {another_vsn}`
> > 
> > ```
> > write(txn, doc_id, prev_rev, json):
> >   txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
> >   rev = generate_new_rev()
> >   txn["{NS} / {docid} / _by_rev / {rev}"] = vsn 
> >   for every json_path in flattened json
> >     - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
> >     if rev is HEAD:
> >       # this range contains values for all json paths for the latest 
> > revision (read optimization)
> >       - {NS} / {docid} / _data / {json_path} = latest_value | part
> >     - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | 
> > {another_vsn}
> >   txn["{NS} / {doc_id} / _rev"] = rev
> > 
> > get_current(txn, doc_id):
> >  # there is no sharing of json_paths in this range (read optimization)
> >  txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data / 
> > 0xFF" )
> > 
> > get_revision(txn, doc_id, rev):
> >  vsn = txn["{NS} / {docid} / _by_rev / {rev}"] 
> >  json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} 
> > / {vsn} / {docid} / _data / 0xFF" )
> >  for every json_path in json_paths:
> >    if value has type vsn:
> >       another_vsn = value
> >          value = txn["{NS} / {docid} / {another_vsn} / _data / {json_path}"]
> >    result[json_path] = value
> > 
> > delete_revision(txn, doc_id, rev):
> >  vsn = txn["{NS} / {docid} / _by_rev / {rev}"] 
> >  json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} 
> > / {vsn} / {docid} / _data / 0xFF" )
> >  for every json_path in json_paths:
> >    if value has type vsn:
> >      # remove reference to deleted revision from the owner
> >       del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
> >    # check if deleted revision of json_path is not used by anything else
> >    if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", 
> > limit=1) == []:
> >       del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
> >    if vsn is HEAD:
> >       copy range for winning revision into "{NS} / {docid} / _data / 
> > {json_path}"
> > ```
> > 
> > best regards,
> > iilyak
> > 
> > On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: 
> >> I think it’s fine to start a focused discussion here as it might help 
> >> inform some of the broader debate over in that thread.
> >> 
> >> As a reminder, today CouchDB writes the entire body of each document 
> >> revision on disk as a separate blob. Edit conflicts that have common 
> >> fields between them do not share any storage on disk. The revision tree is 
> >> encoded into a compact format and a copy of it is stored directly in both 
> >> the by_id tree and the by_seq tree. Each leaf entry in the revision tree 
> >> contain a pointer to the position of the associated doc revision on disk.
> >> 
> >> As a further reminder, CouchDB 2.x clusters can generate edit conflict 
> >> revisions just from multiple clients concurrently updating the same 
> >> document in a single cluster. This won’t happen when FoundationDB is 
> >> running under the hood, but users who deploy multiple CouchDB or PouchDB 
> >> servers and replicate between them can of course still produce conflicts 
> >> just like they could in CouchDB 1.x, so we need a solution.
> >> 
> >> Let’s consider the two sub-topics separately: 1) storage of edit conflict 
> >> bodies and 2) revision trees
> >> 
> >> ## Edit Conflict Storage
> >> 
> >> The simplest possible solution would be to store each document revision 
> >> separately, like we do today. We could store document bodies with 
> >> (“docid”, “revid”) as the key prefix, and each transaction could clear the 
> >> key range associated with the base revision against which the edit is 
> >> being attempted. This would work, but I think we can try to be a bit more 
> >> clever and save on storage space given that we’re splitting JSON documents 
> >> into multiple KV pairs.
> >> 
> >> One thought I’d had is to introduce a special enum Value which indicates 
> >> that the subtree “beneath” the given Key is in conflict. For example, 
> >> consider the documents
> >> 
> >> {
> >>    “_id”: “foo”,
> >>    “_rev”: “1-abc”,
> >>    “owner”: “alice”,
> >>    “active”: true
> >> }
> >> 
> >> and 
> >> 
> >> {
> >>    “_id”: “foo”,
> >>    “_rev”: “1-def”,
> >>    “owner”: “bob”,
> >>    “active”: true
> >> }
> >> 
> >> We could represent these using the following set of KVs:
> >> 
> >> (“foo”, “active”) = true
> >> (“foo”, “owner”) = kCONFLICT
> >> (“foo”, “owner”, “1-abc”) = “alice”
> >> (“foo”, “owner”, “1-def”) = “bob”
> >> 
> >> This approach also extends to conflicts where the two versions have 
> >> different data types. Consider a more complicated example where bob 
> >> dropped the “active” field and changed the “owner” field to an object:
> >> 
> >> {
> >>  “_id”: “foo”,
> >>  “_rev”: “1-def”,
> >>  “owner”: {
> >>    “name”: “bob”,
> >>    “email”: “b...@example.com"
> >>  }
> >> }
> >> 
> >> Now the set of KVs for “foo” looks like this (note that a missing field 
> >> needs to be handled explicitly):
> >> 
> >> (“foo”, “active”) = kCONFLICT
> >> (“foo”, “active”, “1-abc”) = true
> >> (“foo”, “active”, “1-def”) = kMISSING
> >> (“foo”, “owner”) = kCONFLICT
> >> (“foo”, “owner”, “1-abc”) = “alice”
> >> (“foo”, “owner”, “1-def”, “name”) = “bob”
> >> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com”
> >> 
> >> I like this approach for the common case where documents share most of 
> >> their data in common but have a conflict in a very specific field or set 
> >> of fields. 
> >> 
> >> I’ve encountered one important downside, though: an edit that replicates 
> >> in and conflicts with the entire document can cause a bit of a data 
> >> explosion. Consider a case where I have 10 conflicting versions of a 100KB 
> >> document, but the conflicts are all related to a single scalar value. Now 
> >> I replicate in an empty document, and suddenly I have a kCONFLICT at the 
> >> root. In this model I now need to list out every path of every one of the 
> >> 10 existing revisions and I end up with a 1MB update. Yuck. That’s 
> >> technically no worse in the end state than the “zero sharing” case above, 
> >> but one could easily imagine overrunning the transaction size limit this 
> >> way.
> >> 
> >> I suspect there’s a smart path out of this. Maybe the system detects a 
> >> “default” value for each field and uses that instead of writing out the 
> >> value for every revision in a conflicted subtree. Worth some discussion.
> >> 
> >> ## Revision Trees
> >> 
> >> In CouchDB we currently represent revisions as a hash history tree; each 
> >> revision identifier is derived from the content of the revision including 
> >> the revision identifier of its parent. Individual edit branches are 
> >> bounded in *length* (I believe the default is 1000 entries), but the 
> >> number of edit branches is technically unbounded.
> >> 
> >> The size limits in FoundationDB preclude us from storing the entire key 
> >> tree as a single value; in pathological situations the tree could exceed 
> >> 100KB. Rather, I think it would make sense to store each edit *branch* as 
> >> a separate KV. We stem the branch long before it hits the value size 
> >> limit, and in the happy case of no edit conflicts this means we store the 
> >> edit history metadata in a single KV. It also means that we can apply an 
> >> interactive edit without retrieving the entire conflicted revision tree; 
> >> we need only retrieve and modify the single branch against which the edit 
> >> is being applied. The downside is that we duplicate historical revision 
> >> identifiers shared by multiple edit branches, but I think this is a 
> >> worthwhile tradeoff.
> >> 
> >> I would furthermore try to structure the keys so that it is possible to 
> >> retrieve the “winning” revision in a single limit=1 range query. Ideally 
> >> I’d like to proide the following properties:
> >> 
> >> 1) a document read does not need to retrieve the revision tree at all, 
> >> just the winning revision identifier (which would be stored with the rest 
> >> of the doc)
> >> 2) a document update only needs to read the edit branch of the revision 
> >> tree against which the update is being applied, and it can read that 
> >> branch immediately knowing only the content of the edit that is being 
> >> attempted (i.e., it does not need to read the current version of the 
> >> document itself).
> >> 
> >> So, I’d propose a separate subspace (maybe “_meta”?) for the revision 
> >> trees, with keys and values that look like
> >> 
> >> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, 
> >> GrandparentRev, …]
> >> 
> >> The inclusion of IsDeleted, RevPosition and RevHash in the key should be 
> >> sufficient (with the right encoding) to create a range query that 
> >> automatically selects the “winner” according to CouchDB’s arcane rules, 
> >> which are something like
> >> 
> >> 1) deleted=false beats deleted=true
> >> 2) longer paths (i.e. higher RevPosition) beat shorter ones
> >> 3) RevHashes with larger binary values beat ones with smaller values
> >> 
> >> ===========
> >> 
> >> OK, that’s all on this topic from me for now. I think this is a 
> >> particularly exciting area where we start to see the dividends of 
> >> splitting up data into multiple KV pairs in FoundationDB :) Cheers,
> >> 
> >> Adam
> >> 
> >> 
> >>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> wrote:
> >>> 
> >>> This one is quite tightly coupled to the other thread on data model, 
> >>> should we start much conversation here before that one gets closer to a 
> >>> solution?
> >>> 
> >>> -- 
> >>> Robert Samuel Newson
> >>> rnew...@apache.org
> >>> 
> >>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
> >>>> This is a beginning of a discussion thread about storage of edit 
> >>>> conflicts and everything which relates to revisions.
> >>>> 
> >>>> 
> >> 
> >> 
> 
> 

Reply via email to