Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Robert Newson Fri, 08 Feb 2019 05:59:31 -0800

Hi Adam,

I don't think any of us mean "simple" at the expense of all other factors, but 
I'm certainly reacting to what seem to be premature embellishments and 
optimizations. I agree with you (of course) that our plans must account for 
foundationdb's architecture and best practices, and this was evident in your 
earliest work and my first read of it. The talk of using SHA-256 to deal with 
the problem of potentially deep JSON documents, the talk instead of 
deduplicating those paths with a level of indirection, and so on. I won't 
restate these comments again.


Both of your ideas sound fine to me and I think the right move is to try both, 
given, as you've said, the data model is unaffected.

B.

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Fri, 8 Feb 2019, at 06:45, Garren Smith wrote:
> Hi Adam,
> 
> Thanks for the detailed email. In terms of the data model, that makes a lot
> of sense.
> 
> I’m still playing a bit of catchup on understanding how fdb works, so I
> can’t comment on the best way to retrieve a document.
> 
> From my side, I would like to see our decisions also driven by testing and
> validating that our data model works. I find the way that fdb was tested
> and built really impressive. I would love to see us apply some of that to
> the way we build our CouchDB layer.
> 
> Cheers
> Garren
> 
> On Fri, Feb 8, 2019 at 5:35 AM Adam Kocoloski <kocol...@apache.org> wrote:
> 
> > Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a
> > bit “simplistic" to exclusively choose simplicity over performance and
> > storage density. We’re (re)building a database here, one that has some
> > users with pretty demanding performance and scalability requirements. And
> > yes, we should certainly be testing and measuring. Kyle and team are
> > setting up infrastructure in IBM land to help with that now, but I also
> > believe we can design the data model and architecture with a basic
> > performance model of FoundationDB in mind:
> >
> > - reads cost 1ms
> > - short range reads are the same cost as a single lookup
> > - reads of independent parts of the keyspace can be parallelized for cheap
> > - writes are zero-cost until commit time
> >
> > We ought to be able to use these assumptions to drive some decisions about
> > data models ahead of any end-to-end performance test.
> >
> > If there are specific elements of the edit conflicts management where you
> > think greater simplicity is warranted, let’s get those called out. Ilya
> > noted (correctly, in my opinion) that the term sharing stuff is one of
> > those items. It’s relatively complex, potentially a performance hit, and
> > only saves on storage density in the corner case of lots of edit conflicts.
> > That’s a good one to drop.
> >
> > I’m relatively happy with the revision history data model at this point.
> > Hopefully folks find it easy to grok, and it’s efficient for both reads and
> > writes. It costs some extra storage for conflict revisions compared to the
> > current tree representation (up to 16K per edit branch, with default
> > _revs_limit) but knowing what we know about the performance death spiral
> > for wide revision trees today I’ll happily make a storage vs. performance
> > tradeoff here :)
> >
> > Setting the shared term approach aside, I’ve still been mulling over the
> > key structure for the actual document data:
> >
> > -  I thought about trying to construct a special _conflicts subspace, but
> > I don’t like that approach because the choice of a “winning" revision can
> > flip back and forth very quickly with concurrent writers to different edit
> > branches. I think we really want to have a way for revisions to naturally
> > sort themselves so the winner is the first or last revision in a list.
> >
> > - Assuming we’re using key paths of the form (docid, revision-ish, path,
> > to, field), the goal here is to find an efficient way to get the last key
> > with prefix “docid” (assuming winner sorts last), and then all the keys
> > that share the same (docid, revision-ish) prefix as that one. I see two
> > possible approaches so far, neither perfect:
> >
> > Option 1: Execute a get_key() operation with a key selector that asks for
> > the last key less than “docid\xFF” (again assuming winner sorts last), and
> > then do a get_range_startswith() request setting the streaming mode to
> > “want_all” and the prefix to the docid plus whatever revision-ish we found
> > from the get_key() request. This is two roundtrips instead of one, but it
> > always retrieves exactly the right set of keys, and the second step is
> > executed as fast as possible.
> >
> > Option 2: Jump straight to get_range_startswith() request using only
> > “docid” as the prefix, then cancel the iteration once we reach a revision
> > not equal to the first one we see. We might transfer too much data, or we
> > might end up doing multiple roundtrips if the default “iterator” streaming
> > mode sends too little data to start (I haven’t checked what the default
> > iteration block is there), but in the typical case of zero edit conflicts
> > we have a good chance of retrieving the full document in one roundtrip.
> >
> > I don’t have a good sense of which option wins out here from a performance
> > perspective, but they’re both operating on the same data model so easy
> > enough to test the alternatives. The important bit is getting the
> > revision-ish things to sort correctly. I think we can do that by generating
> > something like
> >
> > revision-ish = NotDeleted/1bit : RevPos : RevHash
> >
> > with some suitable order-preserving encoding on the RevPos integer.
> >
> > Apologies for the long email. Happy for any comments, either here or over
> > on IRC. Cheers,
> >
> > Adam
> >
> > > On Feb 7, 2019, at 4:52 PM, Robert Newson <rnew...@apache.org> wrote:
> > >
> > > I think we should choose simple. We can then see if performance is too
> > low or storage overhead too high and then see what we can do about it.
> > >
> > > B.
> > >
> > > --
> > >  Robert Samuel Newson
> > >  rnew...@apache.org
> > >
> > > On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote:
> > >> We cannot do simple thing if we want to support sharing of JSON terms.
> > I
> > >> think if we want the simplest path we should move sharing out of the
> > >> scope. The problem with sharing is we need to know the location of
> > >> shared terms when we do write. This means that we have to read full
> > >> document on every write. There might be tricks to replace full document
> > >> read with some sort of hierarchical signature or sketch of a document.
> > >> However these tricks do not fall into simplest solution category. We
> > >> need to choose the design goals:
> > >> - simple
> > >> - performance
> > >> - reduced storage overhead
> > >>
> > >> best regards,
> > >> iilyak
> > >>
> > >> On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote:
> > >>> I’m also in favor of keeping it really simple and then testing and
> > >>> measuring it.
> > >>>
> > >>> What is the best way to measure that we have something that works? I’m
> > not
> > >>> sure just relying on our current tests will prove that? Should we
> > define
> > >>> and build some more complex situations e.g docs with lots of conflicts
> > or
> > >>> docs with wide revisions and make sure we can solve for those?
> > >>>
> > >>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote:
> > >>>
> > >>>> I’m also very much in favour with starting with the simplest thing
> > that
> > >>>> can possibly work and doesn’t go against the advertised best
> > practices of
> > >>>> FoundationDB. Let’s get that going and get a feel for how it all works
> > >>>> together, before trying to optimise things we can’t measure yet.
> > >>>>
> > >>>> Best
> > >>>> Jan
> > >>>> —
> > >>>>
> > >>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org>
> > >>>> wrote:
> > >>>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> With the Redwood storage engine under development and with prefix
> > >>>> elision part of its design, I don’t think we should get too hung up on
> > >>>> adding complications and indirections in the key space just yet. We
> > haven’t
> > >>>> written a line of code or run any tests, this is premature
> > optimisation.
> > >>>>>
> > >>>>> I’d like to focus on the simplest solution that yields all required
> > >>>> properties. We can embellish later (if warranted).
> > >>>>>
> > >>>>> I am intrigued by all the ideas that might allow us cheaper inserts
> > and
> > >>>> updates than the current code where there are multiple edit branches
> > in the
> > >>>> stored document.
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote:
> > >>>>>>
> > >>>>>> While reading Adam's proposal I came to realize that: we don't have
> > to
> > >>>> calculate winning revision at read time.
> > >>>>>> Since FDB's transactions are atomic we can calculate it when we
> > write.
> > >>>> This means we can just write latest values into separate range. This
> > makes
> > >>>> lookup of latest version fast.
> > >>>>>> Another realization is if we want to share values for some json
> > paths
> > >>>> we would have to introduce a level of indirection.
> > >>>>>> Bellow is the data model inspired by Adam's idea to share
> > json_paths.
> > >>>> In this model the json_path is stored in the revision where it was
> > first
> > >>>> added (we call that revision an owner of a json_path). The values for
> > >>>> json_path key can be scalar values, parts of scalar values or
> > pointers to
> > >>>> owner location.
> > >>>>>> The below snippets are sketches of transactions.
> > >>>>>> The transactions will include updates to other keys as needed
> > >>>> (`external_size`, `by_seq` and so on).  The revision tree management
> > is not
> > >>>> covered yet.
> > >>>>>> The `rev -> vsn` indirection is not strictly required. It is added
> > >>>> because it saves some space since `rev` is a long string and `vsn` is
> > FDB
> > >>>> versionstamp of fixed size.
> > >>>>>>
> > >>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn`
> > >>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
> > >>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part`
> > >>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part |
> > >>>> {another_vsn}`
> > >>>>>>
> > >>>>>> ```
> > >>>>>> write(txn, doc_id, prev_rev, json):
> > >>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
> > >>>>>> rev = generate_new_rev()
> > >>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn
> > >>>>>> for every json_path in flattened json
> > >>>>>>   - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
> > >>>>>>   if rev is HEAD:
> > >>>>>>     # this range contains values for all json paths for the latest
> > >>>> revision (read optimization)
> > >>>>>>     - {NS} / {docid} / _data / {json_path} = latest_value | part
> > >>>>>>   - {NS} / {docid} / {vsn} / _data / {json_path} = value | part |
> > >>>> {another_vsn}
> > >>>>>> txn["{NS} / {doc_id} / _rev"] = rev
> > >>>>>>
> > >>>>>> get_current(txn, doc_id):
> > >>>>>> # there is no sharing of json_paths in this range (read
> > optimization)
> > >>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} /
> > _data
> > >>>> / 0xFF" )
> > >>>>>>
> > >>>>>> get_revision(txn, doc_id, rev):
> > >>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
> > >>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
> > >>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
> > >>>>>> for every json_path in json_paths:
> > >>>>>>  if value has type vsn:
> > >>>>>>     another_vsn = value
> > >>>>>>        value = txn["{NS} / {docid} / {another_vsn} / _data /
> > >>>> {json_path}"]
> > >>>>>>  result[json_path] = value
> > >>>>>>
> > >>>>>> delete_revision(txn, doc_id, rev):
> > >>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
> > >>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
> > >>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
> > >>>>>> for every json_path in json_paths:
> > >>>>>>  if value has type vsn:
> > >>>>>>    # remove reference to deleted revision from the owner
> > >>>>>>     del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
> > >>>>>>  # check if deleted revision of json_path is not used by anything
> > else
> > >>>>>>  if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}",
> > >>>> limit=1) == []:
> > >>>>>>     del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
> > >>>>>>  if vsn is HEAD:
> > >>>>>>     copy range for winning revision into "{NS} / {docid} / _data /
> > >>>> {json_path}"
> > >>>>>> ```
> > >>>>>>
> > >>>>>> best regards,
> > >>>>>> iilyak
> > >>>>>>
> > >>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote:
> > >>>>>>> I think it’s fine to start a focused discussion here as it might
> > help
> > >>>> inform some of the broader debate over in that thread.
> > >>>>>>>
> > >>>>>>> As a reminder, today CouchDB writes the entire body of each
> > document
> > >>>> revision on disk as a separate blob. Edit conflicts that have common
> > fields
> > >>>> between them do not share any storage on disk. The revision tree is
> > encoded
> > >>>> into a compact format and a copy of it is stored directly in both the
> > by_id
> > >>>> tree and the by_seq tree. Each leaf entry in the revision tree
> > contain a
> > >>>> pointer to the position of the associated doc revision on disk.
> > >>>>>>>
> > >>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit
> > conflict
> > >>>> revisions just from multiple clients concurrently updating the same
> > >>>> document in a single cluster. This won’t happen when FoundationDB is
> > >>>> running under the hood, but users who deploy multiple CouchDB or
> > PouchDB
> > >>>> servers and replicate between them can of course still produce
> > conflicts
> > >>>> just like they could in CouchDB 1.x, so we need a solution.
> > >>>>>>>
> > >>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit
> > >>>> conflict bodies and 2) revision trees
> > >>>>>>>
> > >>>>>>> ## Edit Conflict Storage
> > >>>>>>>
> > >>>>>>> The simplest possible solution would be to store each document
> > >>>> revision separately, like we do today. We could store document bodies
> > with
> > >>>> (“docid”, “revid”) as the key prefix, and each transaction could
> > clear the
> > >>>> key range associated with the base revision against which the edit is
> > being
> > >>>> attempted. This would work, but I think we can try to be a bit more
> > clever
> > >>>> and save on storage space given that we’re splitting JSON documents
> > into
> > >>>> multiple KV pairs.
> > >>>>>>>
> > >>>>>>> One thought I’d had is to introduce a special enum Value which
> > >>>> indicates that the subtree “beneath” the given Key is in conflict. For
> > >>>> example, consider the documents
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>>  “_id”: “foo”,
> > >>>>>>>  “_rev”: “1-abc”,
> > >>>>>>>  “owner”: “alice”,
> > >>>>>>>  “active”: true
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> and
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>>  “_id”: “foo”,
> > >>>>>>>  “_rev”: “1-def”,
> > >>>>>>>  “owner”: “bob”,
> > >>>>>>>  “active”: true
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> We could represent these using the following set of KVs:
> > >>>>>>>
> > >>>>>>> (“foo”, “active”) = true
> > >>>>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>>>> (“foo”, “owner”, “1-def”) = “bob”
> > >>>>>>>
> > >>>>>>> This approach also extends to conflicts where the two versions have
> > >>>> different data types. Consider a more complicated example where bob
> > dropped
> > >>>> the “active” field and changed the “owner” field to an object:
> > >>>>>>>
> > >>>>>>> {
> > >>>>>>> “_id”: “foo”,
> > >>>>>>> “_rev”: “1-def”,
> > >>>>>>> “owner”: {
> > >>>>>>>  “name”: “bob”,
> > >>>>>>>  “email”: “b...@example.com"
> > >>>>>>> }
> > >>>>>>> }
> > >>>>>>>
> > >>>>>>> Now the set of KVs for “foo” looks like this (note that a missing
> > >>>> field needs to be handled explicitly):
> > >>>>>>>
> > >>>>>>> (“foo”, “active”) = kCONFLICT
> > >>>>>>> (“foo”, “active”, “1-abc”) = true
> > >>>>>>> (“foo”, “active”, “1-def”) = kMISSING
> > >>>>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> > >>>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com”
> > >>>>>>>
> > >>>>>>> I like this approach for the common case where documents share
> > most of
> > >>>> their data in common but have a conflict in a very specific field or
> > set of
> > >>>> fields.
> > >>>>>>>
> > >>>>>>> I’ve encountered one important downside, though: an edit that
> > >>>> replicates in and conflicts with the entire document can cause a bit
> > of a
> > >>>> data explosion. Consider a case where I have 10 conflicting versions
> > of a
> > >>>> 100KB document, but the conflicts are all related to a single scalar
> > value.
> > >>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT
> > at
> > >>>> the root. In this model I now need to list out every path of every
> > one of
> > >>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s
> > >>>> technically no worse in the end state than the “zero sharing” case
> > above,
> > >>>> but one could easily imagine overrunning the transaction size limit
> > this
> > >>>> way.
> > >>>>>>>
> > >>>>>>> I suspect there’s a smart path out of this. Maybe the system
> > detects a
> > >>>> “default” value for each field and uses that instead of writing out
> > the
> > >>>> value for every revision in a conflicted subtree. Worth some
> > discussion.
> > >>>>>>>
> > >>>>>>> ## Revision Trees
> > >>>>>>>
> > >>>>>>> In CouchDB we currently represent revisions as a hash history tree;
> > >>>> each revision identifier is derived from the content of the revision
> > >>>> including the revision identifier of its parent. Individual edit
> > branches
> > >>>> are bounded in *length* (I believe the default is 1000 entries), but
> > the
> > >>>> number of edit branches is technically unbounded.
> > >>>>>>>
> > >>>>>>> The size limits in FoundationDB preclude us from storing the entire
> > >>>> key tree as a single value; in pathological situations the tree could
> > >>>> exceed 100KB. Rather, I think it would make sense to store each edit
> > >>>> *branch* as a separate KV. We stem the branch long before it hits the
> > value
> > >>>> size limit, and in the happy case of no edit conflicts this means we
> > store
> > >>>> the edit history metadata in a single KV. It also means that we can
> > apply
> > >>>> an interactive edit without retrieving the entire conflicted revision
> > tree;
> > >>>> we need only retrieve and modify the single branch against which the
> > edit
> > >>>> is being applied. The downside is that we duplicate historical
> > revision
> > >>>> identifiers shared by multiple edit branches, but I think this is a
> > >>>> worthwhile tradeoff.
> > >>>>>>>
> > >>>>>>> I would furthermore try to structure the keys so that it is
> > possible
> > >>>> to retrieve the “winning” revision in a single limit=1 range query.
> > Ideally
> > >>>> I’d like to proide the following properties:
> > >>>>>>>
> > >>>>>>> 1) a document read does not need to retrieve the revision tree at
> > all,
> > >>>> just the winning revision identifier (which would be stored with the
> > rest
> > >>>> of the doc)
> > >>>>>>> 2) a document update only needs to read the edit branch of the
> > >>>> revision tree against which the update is being applied, and it can
> > read
> > >>>> that branch immediately knowing only the content of the edit that is
> > being
> > >>>> attempted (i.e., it does not need to read the current version of the
> > >>>> document itself).
> > >>>>>>>
> > >>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the
> > revision
> > >>>> trees, with keys and values that look like
> > >>>>>>>
> > >>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev,
> > >>>> GrandparentRev, …]
> > >>>>>>>
> > >>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key
> > should
> > >>>> be sufficient (with the right encoding) to create a range query that
> > >>>> automatically selects the “winner” according to CouchDB’s arcane
> > rules,
> > >>>> which are something like
> > >>>>>>>
> > >>>>>>> 1) deleted=false beats deleted=true
> > >>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones
> > >>>>>>> 3) RevHashes with larger binary values beat ones with smaller
> > values
> > >>>>>>>
> > >>>>>>> ===========
> > >>>>>>>
> > >>>>>>> OK, that’s all on this topic from me for now. I think this is a
> > >>>> particularly exciting area where we start to see the dividends of
> > splitting
> > >>>> up data into multiple KV pairs in FoundationDB :) Cheers,
> > >>>>>>>
> > >>>>>>> Adam
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org>
> > wrote:
> > >>>>>>>>
> > >>>>>>>> This one is quite tightly coupled to the other thread on data
> > model,
> > >>>> should we start much conversation here before that one gets closer to
> > a
> > >>>> solution?
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Robert Samuel Newson
> > >>>>>>>> rnew...@apache.org
> > >>>>>>>>
> > >>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
> > >>>>>>>>> This is a beginning of a discussion thread about storage of edit
> > >>>>>>>>> conflicts and everything which relates to revisions.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Professional Support for Apache CouchDB:
> > >>>> https://neighbourhood.ie/couchdb-support/
> > >>>>
> > >>>>
> > >>>
> >
> >

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Reply via email to