Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Jan Lehnardt Fri, 08 Feb 2019 07:11:12 -0800

Heya Adam, thanks for re-focussing this. All I meant to say: let’s get your 
initial proposal implemented with the constraints you mention and take it from 
there, not make it even simpler than that.


Best
Jan
—

> On 8. Feb 2019, at 04:35, Adam Kocoloski <[email protected]> wrote:
> 
> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a bit 
> “simplistic" to exclusively choose simplicity over performance and storage 
> density. We’re (re)building a database here, one that has some users with 
> pretty demanding performance and scalability requirements. And yes, we should 
> certainly be testing and measuring. Kyle and team are setting up 
> infrastructure in IBM land to help with that now, but I also believe we can 
> design the data model and architecture with a basic performance model of 
> FoundationDB in mind:
> 
> - reads cost 1ms
> - short range reads are the same cost as a single lookup
> - reads of independent parts of the keyspace can be parallelized for cheap
> - writes are zero-cost until commit time
> 
> We ought to be able to use these assumptions to drive some decisions about 
> data models ahead of any end-to-end performance test.
> 
> If there are specific elements of the edit conflicts management where you 
> think greater simplicity is warranted, let’s get those called out. Ilya noted 
> (correctly, in my opinion) that the term sharing stuff is one of those items. 
> It’s relatively complex, potentially a performance hit, and only saves on 
> storage density in the corner case of lots of edit conflicts. That’s a good 
> one to drop.
> 
> I’m relatively happy with the revision history data model at this point. 
> Hopefully folks find it easy to grok, and it’s efficient for both reads and 
> writes. It costs some extra storage for conflict revisions compared to the 
> current tree representation (up to 16K per edit branch, with default 
> _revs_limit) but knowing what we know about the performance death spiral for 
> wide revision trees today I’ll happily make a storage vs. performance 
> tradeoff here :) 
> 
> Setting the shared term approach aside, I’ve still been mulling over the key 
> structure for the actual document data:
> 
> -  I thought about trying to construct a special _conflicts subspace, but I 
> don’t like that approach because the choice of a “winning" revision can flip 
> back and forth very quickly with concurrent writers to different edit 
> branches. I think we really want to have a way for revisions to naturally 
> sort themselves so the winner is the first or last revision in a list.
> 
> - Assuming we’re using key paths of the form (docid, revision-ish, path, to, 
> field), the goal here is to find an efficient way to get the last key with 
> prefix “docid” (assuming winner sorts last), and then all the keys that share 
> the same (docid, revision-ish) prefix as that one. I see two possible 
> approaches so far, neither perfect:
> 
> Option 1: Execute a get_key() operation with a key selector that asks for the 
> last key less than “docid\xFF” (again assuming winner sorts last), and then 
> do a get_range_startswith() request setting the streaming mode to “want_all” 
> and the prefix to the docid plus whatever revision-ish we found from the 
> get_key() request. This is two roundtrips instead of one, but it always 
> retrieves exactly the right set of keys, and the second step is executed as 
> fast as possible.
> 
> Option 2: Jump straight to get_range_startswith() request using only “docid” 
> as the prefix, then cancel the iteration once we reach a revision not equal 
> to the first one we see. We might transfer too much data, or we might end up 
> doing multiple roundtrips if the default “iterator” streaming mode sends too 
> little data to start (I haven’t checked what the default iteration block is 
> there), but in the typical case of zero edit conflicts we have a good chance 
> of retrieving the full document in one roundtrip.
> 
> I don’t have a good sense of which option wins out here from a performance 
> perspective, but they’re both operating on the same data model so easy enough 
> to test the alternatives. The important bit is getting the revision-ish 
> things to sort correctly. I think we can do that by generating something like
> 
> revision-ish = NotDeleted/1bit : RevPos : RevHash
> 
> with some suitable order-preserving encoding on the RevPos integer.
> 
> Apologies for the long email. Happy for any comments, either here or over on 
> IRC. Cheers,
> 
> Adam
> 
>> On Feb 7, 2019, at 4:52 PM, Robert Newson <[email protected]> wrote:
>> 
>> I think we should choose simple. We can then see if performance is too low 
>> or storage overhead too high and then see what we can do about it.
>> 
>> B.
>> 
>> -- 
>> Robert Samuel Newson
>> [email protected]
>> 
>> On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote:
>>> We cannot do simple thing if we want to support sharing of JSON terms. I 
>>> think if we want the simplest path we should move sharing out of the 
>>> scope. The problem with sharing is we need to know the location of 
>>> shared terms when we do write. This means that we have to read full 
>>> document on every write. There might be tricks to replace full document 
>>> read with some sort of hierarchical signature or sketch of a document. 
>>> However these tricks do not fall into simplest solution category. We 
>>> need to choose the design goals:
>>> - simple
>>> - performance
>>> - reduced storage overhead
>>> 
>>> best regards,
>>> iilyak
>>> 
>>> On 2019/02/07 12:45:34, Garren Smith <[email protected]> wrote: 
>>>> I’m also in favor of keeping it really simple and then testing and
>>>> measuring it.
>>>> 
>>>> What is the best way to measure that we have something that works? I’m not
>>>> sure just relying on our current tests will prove that? Should we define
>>>> and build some more complex situations e.g docs with lots of conflicts or
>>>> docs with wide revisions and make sure we can solve for those?
>>>> 
>>>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <[email protected]> wrote:
>>>> 
>>>>> I’m also very much in favour with starting with the simplest thing that
>>>>> can possibly work and doesn’t go against the advertised best practices of
>>>>> FoundationDB. Let’s get that going and get a feel for how it all works
>>>>> together, before trying to optimise things we can’t measure yet.
>>>>> 
>>>>> Best
>>>>> Jan
>>>>> —
>>>>> 
>>>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> With the Redwood storage engine under development and with prefix
>>>>> elision part of its design, I don’t think we should get too hung up on
>>>>> adding complications and indirections in the key space just yet. We 
>>>>> haven’t
>>>>> written a line of code or run any tests, this is premature optimisation.
>>>>>> 
>>>>>> I’d like to focus on the simplest solution that yields all required
>>>>> properties. We can embellish later (if warranted).
>>>>>> 
>>>>>> I am intrigued by all the ideas that might allow us cheaper inserts and
>>>>> updates than the current code where there are multiple edit branches in 
>>>>> the
>>>>> stored document.
>>>>>> 
>>>>>> B.
>>>>>> 
>>>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <[email protected]> wrote:
>>>>>>> 
>>>>>>> While reading Adam's proposal I came to realize that: we don't have to
>>>>> calculate winning revision at read time.
>>>>>>> Since FDB's transactions are atomic we can calculate it when we write.
>>>>> This means we can just write latest values into separate range. This makes
>>>>> lookup of latest version fast.
>>>>>>> Another realization is if we want to share values for some json paths
>>>>> we would have to introduce a level of indirection.
>>>>>>> Bellow is the data model inspired by Adam's idea to share json_paths.
>>>>> In this model the json_path is stored in the revision where it was first
>>>>> added (we call that revision an owner of a json_path). The values for
>>>>> json_path key can be scalar values, parts of scalar values or pointers to
>>>>> owner location.
>>>>>>> The below snippets are sketches of transactions.
>>>>>>> The transactions will include updates to other keys as needed
>>>>> (`external_size`, `by_seq` and so on).  The revision tree management is 
>>>>> not
>>>>> covered yet.
>>>>>>> The `rev -> vsn` indirection is not strictly required. It is added
>>>>> because it saves some space since `rev` is a long string and `vsn` is FDB
>>>>> versionstamp of fixed size.
>>>>>>> 
>>>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn`
>>>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
>>>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part`
>>>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>>>> {another_vsn}`
>>>>>>> 
>>>>>>> ```
>>>>>>> write(txn, doc_id, prev_rev, json):
>>>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
>>>>>>> rev = generate_new_rev()
>>>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn
>>>>>>> for every json_path in flattened json
>>>>>>>  - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
>>>>>>>  if rev is HEAD:
>>>>>>>    # this range contains values for all json paths for the latest
>>>>> revision (read optimization)
>>>>>>>    - {NS} / {docid} / _data / {json_path} = latest_value | part
>>>>>>>  - {NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>>>> {another_vsn}
>>>>>>> txn["{NS} / {doc_id} / _rev"] = rev
>>>>>>> 
>>>>>>> get_current(txn, doc_id):
>>>>>>> # there is no sharing of json_paths in this range (read optimization)
>>>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data
>>>>> / 0xFF" )
>>>>>>> 
>>>>>>> get_revision(txn, doc_id, rev):
>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>>>> for every json_path in json_paths:
>>>>>>> if value has type vsn:
>>>>>>>    another_vsn = value
>>>>>>>       value = txn["{NS} / {docid} / {another_vsn} / _data /
>>>>> {json_path}"]
>>>>>>> result[json_path] = value
>>>>>>> 
>>>>>>> delete_revision(txn, doc_id, rev):
>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>>>> for every json_path in json_paths:
>>>>>>> if value has type vsn:
>>>>>>>   # remove reference to deleted revision from the owner
>>>>>>>    del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
>>>>>>> # check if deleted revision of json_path is not used by anything else
>>>>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}",
>>>>> limit=1) == []:
>>>>>>>    del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
>>>>>>> if vsn is HEAD:
>>>>>>>    copy range for winning revision into "{NS} / {docid} / _data /
>>>>> {json_path}"
>>>>>>> ```
>>>>>>> 
>>>>>>> best regards,
>>>>>>> iilyak
>>>>>>> 
>>>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <[email protected]> wrote:
>>>>>>>> I think it’s fine to start a focused discussion here as it might help
>>>>> inform some of the broader debate over in that thread.
>>>>>>>> 
>>>>>>>> As a reminder, today CouchDB writes the entire body of each document
>>>>> revision on disk as a separate blob. Edit conflicts that have common 
>>>>> fields
>>>>> between them do not share any storage on disk. The revision tree is 
>>>>> encoded
>>>>> into a compact format and a copy of it is stored directly in both the 
>>>>> by_id
>>>>> tree and the by_seq tree. Each leaf entry in the revision tree contain a
>>>>> pointer to the position of the associated doc revision on disk.
>>>>>>>> 
>>>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit conflict
>>>>> revisions just from multiple clients concurrently updating the same
>>>>> document in a single cluster. This won’t happen when FoundationDB is
>>>>> running under the hood, but users who deploy multiple CouchDB or PouchDB
>>>>> servers and replicate between them can of course still produce conflicts
>>>>> just like they could in CouchDB 1.x, so we need a solution.
>>>>>>>> 
>>>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit
>>>>> conflict bodies and 2) revision trees
>>>>>>>> 
>>>>>>>> ## Edit Conflict Storage
>>>>>>>> 
>>>>>>>> The simplest possible solution would be to store each document
>>>>> revision separately, like we do today. We could store document bodies with
>>>>> (“docid”, “revid”) as the key prefix, and each transaction could clear the
>>>>> key range associated with the base revision against which the edit is 
>>>>> being
>>>>> attempted. This would work, but I think we can try to be a bit more clever
>>>>> and save on storage space given that we’re splitting JSON documents into
>>>>> multiple KV pairs.
>>>>>>>> 
>>>>>>>> One thought I’d had is to introduce a special enum Value which
>>>>> indicates that the subtree “beneath” the given Key is in conflict. For
>>>>> example, consider the documents
>>>>>>>> 
>>>>>>>> {
>>>>>>>> “_id”: “foo”,
>>>>>>>> “_rev”: “1-abc”,
>>>>>>>> “owner”: “alice”,
>>>>>>>> “active”: true
>>>>>>>> }
>>>>>>>> 
>>>>>>>> and
>>>>>>>> 
>>>>>>>> {
>>>>>>>> “_id”: “foo”,
>>>>>>>> “_rev”: “1-def”,
>>>>>>>> “owner”: “bob”,
>>>>>>>> “active”: true
>>>>>>>> }
>>>>>>>> 
>>>>>>>> We could represent these using the following set of KVs:
>>>>>>>> 
>>>>>>>> (“foo”, “active”) = true
>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
>>>>>>>> 
>>>>>>>> This approach also extends to conflicts where the two versions have
>>>>> different data types. Consider a more complicated example where bob 
>>>>> dropped
>>>>> the “active” field and changed the “owner” field to an object:
>>>>>>>> 
>>>>>>>> {
>>>>>>>> “_id”: “foo”,
>>>>>>>> “_rev”: “1-def”,
>>>>>>>> “owner”: {
>>>>>>>> “name”: “bob”,
>>>>>>>> “email”: “[email protected]"
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> Now the set of KVs for “foo” looks like this (note that a missing
>>>>> field needs to be handled explicitly):
>>>>>>>> 
>>>>>>>> (“foo”, “active”) = kCONFLICT
>>>>>>>> (“foo”, “active”, “1-abc”) = true
>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = “[email protected]”
>>>>>>>> 
>>>>>>>> I like this approach for the common case where documents share most of
>>>>> their data in common but have a conflict in a very specific field or set 
>>>>> of
>>>>> fields.
>>>>>>>> 
>>>>>>>> I’ve encountered one important downside, though: an edit that
>>>>> replicates in and conflicts with the entire document can cause a bit of a
>>>>> data explosion. Consider a case where I have 10 conflicting versions of a
>>>>> 100KB document, but the conflicts are all related to a single scalar 
>>>>> value.
>>>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT at
>>>>> the root. In this model I now need to list out every path of every one of
>>>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s
>>>>> technically no worse in the end state than the “zero sharing” case above,
>>>>> but one could easily imagine overrunning the transaction size limit this
>>>>> way.
>>>>>>>> 
>>>>>>>> I suspect there’s a smart path out of this. Maybe the system detects a
>>>>> “default” value for each field and uses that instead of writing out the
>>>>> value for every revision in a conflicted subtree. Worth some discussion.
>>>>>>>> 
>>>>>>>> ## Revision Trees
>>>>>>>> 
>>>>>>>> In CouchDB we currently represent revisions as a hash history tree;
>>>>> each revision identifier is derived from the content of the revision
>>>>> including the revision identifier of its parent. Individual edit branches
>>>>> are bounded in *length* (I believe the default is 1000 entries), but the
>>>>> number of edit branches is technically unbounded.
>>>>>>>> 
>>>>>>>> The size limits in FoundationDB preclude us from storing the entire
>>>>> key tree as a single value; in pathological situations the tree could
>>>>> exceed 100KB. Rather, I think it would make sense to store each edit
>>>>> *branch* as a separate KV. We stem the branch long before it hits the 
>>>>> value
>>>>> size limit, and in the happy case of no edit conflicts this means we store
>>>>> the edit history metadata in a single KV. It also means that we can apply
>>>>> an interactive edit without retrieving the entire conflicted revision 
>>>>> tree;
>>>>> we need only retrieve and modify the single branch against which the edit
>>>>> is being applied. The downside is that we duplicate historical revision
>>>>> identifiers shared by multiple edit branches, but I think this is a
>>>>> worthwhile tradeoff.
>>>>>>>> 
>>>>>>>> I would furthermore try to structure the keys so that it is possible
>>>>> to retrieve the “winning” revision in a single limit=1 range query. 
>>>>> Ideally
>>>>> I’d like to proide the following properties:
>>>>>>>> 
>>>>>>>> 1) a document read does not need to retrieve the revision tree at all,
>>>>> just the winning revision identifier (which would be stored with the rest
>>>>> of the doc)
>>>>>>>> 2) a document update only needs to read the edit branch of the
>>>>> revision tree against which the update is being applied, and it can read
>>>>> that branch immediately knowing only the content of the edit that is being
>>>>> attempted (i.e., it does not need to read the current version of the
>>>>> document itself).
>>>>>>>> 
>>>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the revision
>>>>> trees, with keys and values that look like
>>>>>>>> 
>>>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev,
>>>>> GrandparentRev, …]
>>>>>>>> 
>>>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key should
>>>>> be sufficient (with the right encoding) to create a range query that
>>>>> automatically selects the “winner” according to CouchDB’s arcane rules,
>>>>> which are something like
>>>>>>>> 
>>>>>>>> 1) deleted=false beats deleted=true
>>>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones
>>>>>>>> 3) RevHashes with larger binary values beat ones with smaller values
>>>>>>>> 
>>>>>>>> ===========
>>>>>>>> 
>>>>>>>> OK, that’s all on this topic from me for now. I think this is a
>>>>> particularly exciting area where we start to see the dividends of 
>>>>> splitting
>>>>> up data into multiple KV pairs in FoundationDB :) Cheers,
>>>>>>>> 
>>>>>>>> Adam
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> This one is quite tightly coupled to the other thread on data model,
>>>>> should we start much conversation here before that one gets closer to a
>>>>> solution?
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Robert Samuel Newson
>>>>>>>>> [email protected]
>>>>>>>>> 
>>>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
>>>>>>>>>> This is a beginning of a discussion thread about storage of edit
>>>>>>>>>> conflicts and everything which relates to revisions.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Professional Support for Apache CouchDB:
>>>>> https://neighbourhood.ie/couchdb-support/
>>>>> 
>>>>> 
>>>> 
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Reply via email to