Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Paul Davis Wed, 20 Feb 2019 07:49:53 -0800

Strongly agree that we very much don't want to have Erlang-isms being
pushed into fdb. Regardless of what we end up with I'd like to see a
very strong (de)?serialization layer with some significant test
coverage.


On Tue, Feb 19, 2019 at 6:54 PM Adam Kocoloski <kocol...@apache.org> wrote:
>
> Yes, that sort of versioning has been omitted from the various concrete 
> proposals but we definitely want to have it. We’ve seen the alternative in 
> some of the Erlang records that we serialize to disk today and it ain’t 
> pretty.
>
> I can imagine that we’ll want to have the codebase laid out in a way that 
> allows us to upgrade to a smarter KV encoding over time without major 
> surgery, which I think is a good “layer of abstraction”. I would be nervous 
> if we started having abstract containers of data structures pushed down into 
> FDB itself :)
>
> Adam
>
> > On Feb 19, 2019, at 5:41 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote:
> >
> > A simple doc storage version number would likely be enough for future us to
> > do fancier things.
> >
> > On Tue, Feb 19, 2019 at 4:16 PM Benjamin Anderson <banjie...@apache.org>
> > wrote:
> >
> >>> I don’t think adding a layer of abstraction is the right move just yet,
> >> I think we should continue to find consensus on one answer to this question
> >>
> >> Agree that the theorycrafting stage is not optimal for making
> >> abstraction decisions, but I suspect it would be worthwhile somewhere
> >> between prototyping and releasing. Adam's proposal does seem to me the
> >> most appealing approach on the surface, and I don't see anyone signing
> >> up to do the work to deliver an alternative concurrently.
> >>
> >> --
> >> ba
> >>
> >> On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson <rnew...@apache.org>
> >> wrote:
> >>>
> >>> Addendum: By “directory aliasing” I meant within a document (either the
> >> actual Directory thing or something equivalent of our own making). The
> >> directory aliasing for each database is a good way to reduce key size
> >> without a significant cost. Though if Redwood lands in time, even this
> >> would become an inutile obfuscation].
> >>>
> >>>> On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org>
> >> wrote:
> >>>>
> >>>> Interesting suggestion, obviously the details might get the wrong kind
> >> of fun.
> >>>>
> >>>> Somewhere above I suggested this would be something we could change
> >> over time and even use different approaches for different documents within
> >> the same database. This is the long way of saying there are multiple ways
> >> to do this each with advantages and none without disadvantages.
> >>>>
> >>>> I don’t think adding a layer of abstraction is the right move just
> >> yet, I think we should continue to find consensus on one answer to this
> >> question (and the related ones in other threads) for the first release.
> >> It’s easy to say “we can change it later”, of course. We can, though it
> >> would be a chunk of work in the context of something that already works,
> >> I’ve rarely seen anyone sign up for that.
> >>>>
> >>>> I’m fine with the first proposal from Adam, where the keys are tuples
> >> of key parts pointing at terminal values. To make it easier for the first
> >> version, I would exclude optimisations like deduplication or the Directory
> >> aliasing or the schema thing that I suggested and that Ilya incorporated a
> >> variant of in a follow-up post. We’d accept that there are limits on the
> >> sizes of documents, including the awkward-to-express one about property
> >> depth.
> >>>>
> >>>> Stepping back, I’m not seeing any essential improvement over Adam’s
> >> original proposal besides the few corrections and clarifications made by
> >> various authors. Could we start an RFC based on Adam’s original proposal on
> >> document body, revision tree and index storage? We could then have PR’s
> >> against that for each additional optimisation (one person’s optimisation is
> >> another person’s needless complication)?
> >>>>
> >>>> If I’ve missed some genuine advance on the original proposal in this
> >> long thread, please call it out for me.
> >>>>
> >>>> B.
> >>>>
> >>>>> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org>
> >> wrote:
> >>>>>
> >>>>> As is evident by the length of this thread, there's a pretty big
> >>>>> design space to cover here, and it seems unlikely we'll have arrived
> >>>>> at a "correct" solution even by the time this thing ships. Perhaps it
> >>>>> would be worthwhile to treat the in-FDB representation of data as a
> >>>>> first-class abstraction and support multiple representations
> >>>>> simultaneously?
> >>>>>
> >>>>> Obviously there's no such thing as a zero-cost abstraction - and I've
> >>>>> not thought very hard about how far up the stack the document
> >>>>> representation would need to leak - but supporting different layouts
> >>>>> (primarily, as Adam points out, on the document body itself) might
> >>>>> prove interesting and useful. I'm sure there are folks interested in a
> >>>>> column-shaped CouchDB, for example.
> >>>>>
> >>>>> --
> >>>>> b
> >>>>>
> >>>>> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org>
> >> wrote:
> >>>>>>
> >>>>>> Good points on revtree, I agree with you we should store that
> >> intelligently to gain the benefits you mentioned.
> >>>>>>
> >>>>>> --
> >>>>>> Robert Samuel Newson
> >>>>>> rnew...@apache.org
> >>>>>>
> >>>>>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
> >>>>>>> I do not think we should store the revtree as a blob. The design
> >> where
> >>>>>>> each edit branch is its own KV should save on network IO and CPU
> >> cycles
> >>>>>>> for normal updates. We’ve performed too many heroics to keep
> >>>>>>> couch_key_tree from stalling entire databases when trying to update
> >> a
> >>>>>>> single document with a wide revision tree, I would much prefer to
> >> ignore
> >>>>>>> other edit branches entirely when all we’re doing is extending one
> >> of
> >>>>>>> them.
> >>>>>>>
> >>>>>>> I also do not think we should store JSON documents as blobs, but
> >> it’s a
> >>>>>>> closer call. Some of my reasoning for preferring the exploded path
> >>>>>>> design:
> >>>>>>>
> >>>>>>> - it lends itself nicely to sub-document operations, for which Jan
> >>>>>>> crafted an RFC last year:
> >> https://github.com/apache/couchdb/issues/1559
> >>>>>>> - it optimizes the creation of Mango indexes on existing databases
> >> since
> >>>>>>> we only need to retrieve the value(s) we want to index
> >>>>>>> - it optimizes Mango queries that use field selectors
> >>>>>>> - anyone who wanted to try their hand at GraphQL will find it very
> >>>>>>> handy: https://github.com/apache/couchdb/issues/1499
> >>>>>>> - looking further ahead, it lets us play with smarter leaf value
> >> types
> >>>>>>> like Counters (yes I’m still on the CRDT bandwagon, sorry)
> >>>>>>>
> >>>>>>> A few comments on the thread:
> >>>>>>>
> >>>>>>>>>> * Most documents bodies are probably going to be smaller than
> >> 100k. So in
> >>>>>>>>>> the majority of case it would be one write / one read to update
> >> and fetch
> >>>>>>>>>> the document body.
> >>>>>>>
> >>>>>>> We should test, but I expect reading 50KB of data in a range query
> >> is
> >>>>>>> almost as efficient as reading a single 50 KB value. Similarly,
> >> writes
> >>>>>>> to a contiguous set of keys should be quite efficient.
> >>>>>>>
> >>>>>>> I am concerned about the overhead of the repeated field paths in the
> >>>>>>> keys with the exploded path option in the absence of key prefix
> >>>>>>> compression. That would be my main reason to acquiesce and throw
> >> away
> >>>>>>> all the document structure.
> >>>>>>>
> >>>>>>> Adam
> >>>>>>>
> >>>>>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org>
> >> wrote:
> >>>>>>>>
> >>>>>>>> I like the idea that we'd reuse the same pattern (but perhaps not
> >> the same _code_) for doc bodies, revtree and attachments.
> >>>>>>>>
> >>>>>>>> I hope we still get to delete couch_key_tree.erl, though.
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Robert Samuel Newson
> >>>>>>>> rnew...@apache.org
> >>>>>>>>
> >>>>>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
> >>>>>>>>> I like the idea from a “trying a simple thing first” perspective,
> >> but
> >>>>>>>>> Nick’s points below are especially convincing to with this for
> >> now.
> >>>>>>>>>
> >>>>>>>>> Best
> >>>>>>>>> Jan
> >>>>>>>>> —
> >>>>>>>>>
> >>>>>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com>
> >> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> Sorry for jumping in so late, I was following from the sidelines
> >> mostly. A
> >>>>>>>>>> lot of good discussion happening and am excited about the
> >> possibilities
> >>>>>>>>>> here.
> >>>>>>>>>>
> >>>>>>>>>> I do like the simpler "chunking" approach for a few reasons:
> >>>>>>>>>>
> >>>>>>>>>> * Most documents bodies are probably going to be smaller than
> >> 100k. So in
> >>>>>>>>>> the majority of case it would be one write / one read to update
> >> and fetch
> >>>>>>>>>> the document body.
> >>>>>>>>>>
> >>>>>>>>>> * We could reuse the chunking code for attachment handling and
> >> possibly
> >>>>>>>>>> revision key trees. So it's the general pattern of upload chunks
> >> to some
> >>>>>>>>>> prefix, and when finished flip an atomic toggle to make it
> >> current.
> >>>>>>>>>>
> >>>>>>>>>> * Do the same thing with revision trees and we could re-use the
> >> revision
> >>>>>>>>>> tree manipulation logic. That is, the key tree in most cases
> >> would be small
> >>>>>>>>>> enough to fit in 100k but if they get huge, they'd get chunked.
> >> This would
> >>>>>>>>>> allow us to reuse all the battle tested couch_key_tree code
> >> mostly as is.
> >>>>>>>>>> We even have property tests for it
> >>>>>>>>>>
> >> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> >>>>>>>>>>
> >>>>>>>>>> * It removes the need to explain the max exploded path length
> >> limitation to
> >>>>>>>>>> customers.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> -Nick
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <
> >> rnew...@apache.org> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> An alternative storage model that we should seriously consider
> >> is to
> >>>>>>>>>>> follow our current approach in couch_file et al. Specifically,
> >> that the
> >>>>>>>>>>> document _body_ is stored as an uninterpreted binary value.
> >> This would be
> >>>>>>>>>>> much like the obvious plan for attachment storage; a key prefix
> >> that
> >>>>>>>>>>> identifies the database and document, with the final item of
> >> that key tuple
> >>>>>>>>>>> is an incrementing integer. Each of those keys has a binary
> >> value of up to
> >>>>>>>>>>> 100k. Fetching all values with that key prefix, in fdb's
> >> natural ordering,
> >>>>>>>>>>> will yield the full document body, which can be JSON decoded
> >> for further
> >>>>>>>>>>> processing.
> >>>>>>>>>>>
> >>>>>>>>>>> I like this idea, and I like Adam's original proposal to
> >> explode documents
> >>>>>>>>>>> into property paths. I have a slight preference for the
> >> simplicity of the
> >>>>>>>>>>> idea in the previous paragraph, not least because it's close to
> >> what we do
> >>>>>>>>>>> today. I also think it will be possible to migrate to
> >> alternative storage
> >>>>>>>>>>> models in future, and foundationdb's transaction supports means
> >> we can do
> >>>>>>>>>>> this migration seamlessly should we come to it.
> >>>>>>>>>>>
> >>>>>>>>>>> I'm very interested in knowing if anyone else is interested in
> >> going this
> >>>>>>>>>>> simple, or considers it a wasted opportunity relative to the
> >> 'exploded'
> >>>>>>>>>>> path.
> >>>>>>>>>>>
> >>>>>>>>>>> B.
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Robert Samuel Newson
> >>>>>>>>>>> rnew...@apache.org
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
> >>>>>>>>>>>> I've been remiss here in not posting the data model ideas that
> >> IBM
> >>>>>>>>>>>> worked up while we were thinking about using FoundationDB so
> >> I'm posting
> >>>>>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just
> >> transcribing
> >>>>>>>>>>>> it, and this is the context that the folks from the IBM side
> >> came in
> >>>>>>>>>>>> with, for full disclosure.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basics
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. All CouchDB databases are inside a Directory
> >>>>>>>>>>>> 2. Each CouchDB database is a Directory within that Directory
> >>>>>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so
> >>>>>>>>>>>> `_all_dbs` is the list of directories from 1.
> >>>>>>>>>>>> 4. Each Directory representing a CouchdB database has several
> >> Subspaces;
> >>>>>>>>>>>> 4a. by_id/ doc subspace: actual document contents
> >>>>>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
> >>>>>>>>>>>> 4c. index_definitions, indexes, ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> JSON Mapping
> >>>>>>>>>>>>
> >>>>>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs
> >> in FDB:
> >>>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> “_id”: “foo”,
> >>>>>>>>>>>> “owner”: “bob”,
> >>>>>>>>>>>> “mylist”: [1,3,5],
> >>>>>>>>>>>> “mymap”: {
> >>>>>>>>>>>>    “blue”: “#0000FF”,
> >>>>>>>>>>>>    “red”: “#FF0000”
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> maps to
> >>>>>>>>>>>>
> >>>>>>>>>>>> (“foo”, “owner”) = “bob”
> >>>>>>>>>>>> (“foo”, “mylist”, 0) = 1
> >>>>>>>>>>>> (“foo”, “mylist”, 1) = 3
> >>>>>>>>>>>> (“foo”, “mylist”, 2) = 5
> >>>>>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
> >>>>>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000”
> >>>>>>>>>>>>
> >>>>>>>>>>>> NB: this means that the 100KB limit applies to individual
> >> leafs in the
> >>>>>>>>>>>> JSON object, not the entire doc
> >>>>>>>>>>>>
> >>>>>>>>>>>> Edit Conflicts
> >>>>>>>>>>>>
> >>>>>>>>>>>> We need to account for the presence of conflicts in various
> >> levels of
> >>>>>>>>>>>> the doc due to replication.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Proposal is to create a special value indicating that the
> >> subtree below
> >>>>>>>>>>>> our current cursor position is in an unresolvable conflict.
> >> Then add
> >>>>>>>>>>>> additional KV pairs below to describe the conflicting entries.
> >>>>>>>>>>>>
> >>>>>>>>>>>> KV data model allows us to store these efficiently and minimize
> >>>>>>>>>>>> duplication of data:
> >>>>>>>>>>>>
> >>>>>>>>>>>> A document with these two conflicts:
> >>>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> “_id”: “foo”,
> >>>>>>>>>>>> “_rev”: “1-abc”,
> >>>>>>>>>>>> “owner”: “alice”,
> >>>>>>>>>>>> “active”: true
> >>>>>>>>>>>> }
> >>>>>>>>>>>> {
> >>>>>>>>>>>> “_id”: “foo”,
> >>>>>>>>>>>> “_rev”: “1-def”,
> >>>>>>>>>>>> “owner”: “bob”,
> >>>>>>>>>>>> “active”: true
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> could be stored thus:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (“foo”, “active”) = true
> >>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT
> >>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> >>>>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
> >>>>>>>>>>>>
> >>>>>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting
> >> subtree this
> >>>>>>>>>>>> representation can handle conflicts of different data types as
> >> well.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Missing fields need to be handled explicitly:
> >>>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> “_id”: “foo”,
> >>>>>>>>>>>> “_rev”: “1-abc”,
> >>>>>>>>>>>> “owner”: “alice”,
> >>>>>>>>>>>> “active”: true
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> {
> >>>>>>>>>>>> “_id”: “foo”,
> >>>>>>>>>>>> “_rev”: “1-def”,
> >>>>>>>>>>>> “owner”: {
> >>>>>>>>>>>> “name”: “bob”,
> >>>>>>>>>>>> “email”: “
> >>>>>>>>>>>> b...@example.com
> >>>>>>>>>>>> "
> >>>>>>>>>>>> }
> >>>>>>>>>>>> }
> >>>>>>>>>>>>
> >>>>>>>>>>>> could be stored thus:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (“foo”, “active”) = kCONFLICT
> >>>>>>>>>>>> (“foo”, “active”, “1-abc”) = true
> >>>>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
> >>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT
> >>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> >>>>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> >>>>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ...
> >>>>>>>>>>>>
> >>>>>>>>>>>> Revision Metadata
> >>>>>>>>>>>>
> >>>>>>>>>>>> * CouchDB uses a hash history for revisions
> >>>>>>>>>>>> ** Each edit is identified by the hash of the content of the
> >> edit
> >>>>>>>>>>>> including the base revision against which it was applied
> >>>>>>>>>>>> ** Individual edit branches are bounded in length but the
> >> number of
> >>>>>>>>>>>> branches is potentially unbounded
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Size limits preclude us from storing the entire key tree as
> >> a single
> >>>>>>>>>>>> value; in pathological situations
> >>>>>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes)
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Store each edit branch as a separate KV including deleted
> >> status in a
> >>>>>>>>>>>> special subspace
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Structure key representation so that “winning” revision can
> >> be
> >>>>>>>>>>>> automatically retrieved in a limit=1
> >>>>>>>>>>>> key range operation
> >>>>>>>>>>>>
> >>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
> >>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) =
> >> [“3-baz”,”2-bar”,”1-foo”]
> >>>>>>>>>>>> <-- winner
> >>>>>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
> >>>>>>>>>>>>
> >>>>>>>>>>>> Changes Feed
> >>>>>>>>>>>>
> >>>>>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte,
> >> unique,
> >>>>>>>>>>>> monotonically (but not sequentially) increasing value for each
> >> committed
> >>>>>>>>>>>> transaction. The first 8 bytes are the committed version of the
> >>>>>>>>>>>> database. The last 2 bytes are monotonic in the serialization
> >> order for
> >>>>>>>>>>>> transactions.
> >>>>>>>>>>>>
> >>>>>>>>>>>> * A transaction can specify a particular index into a key
> >> where the
> >>>>>>>>>>>> following 10 bytes will be overwritten by the versionstamp at
> >> commit
> >>>>>>>>>>>> time
> >>>>>>>>>>>>
> >>>>>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes
> >> feed
> >>>>>>>>>>>>
> >>>>>>>>>>>> by_seq subspace
> >>>>>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”)
> >>>>>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”)
> >>>>>>>>>>>>
> >>>>>>>>>>>> by_id subspace
> >>>>>>>>>>>> (“bar”, “_vsn”) = “versionstamp4”
> >>>>>>>>>>>> ...
> >>>>>>>>>>>> (“foo”, “_vsn”) = “versionstamp1”
> >>>>>>>>>>>>
> >>>>>>>>>>>> JSON Indexes
> >>>>>>>>>>>>
> >>>>>>>>>>>> * “Mango” JSON indexes are defined by
> >>>>>>>>>>>> ** a list of field names, each of which may be nested,
> >>>>>>>>>>>> ** an optional partial_filter_selector which constrains the
> >> set of docs
> >>>>>>>>>>>> that contribute
> >>>>>>>>>>>> ** an optional name defined by the ddoc field (the name is
> >> auto-
> >>>>>>>>>>>> generated if not supplied)
> >>>>>>>>>>>>
> >>>>>>>>>>>> * Store index definitions in a single subspace to aid query
> >> planning
> >>>>>>>>>>>> ** ((person,name), title, email) = (“name-title-email”,
> >> “{“student”:
> >>>>>>>>>>>> true}”)
> >>>>>>>>>>>> ** Store the values for each index in a dedicated subspace,
> >> adding the
> >>>>>>>>>>>> document ID as the last element in the tuple
> >>>>>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”)
> >> = null
> >>>>>>>>>>>>
> >>>>>>>>>>>> B.
> >>>>>>>>>>>>
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Robert Samuel Newson
> >>>>>>>>>>>> rnew...@apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I want to fix previous mistakes. I did two mistakes in
> >> previous
> >>>>>>>>>>>>> calculations:
> >>>>>>>>>>>>> - I used 1Kb as base size for calculating expansion factor
> >> (although
> >>>>>>>>>>> we
> >>>>>>>>>>>>> don't know exact size of original document)
> >>>>>>>>>>>>> - The expansion factor calculation included number of
> >> revisions (it
> >>>>>>>>>>>>> shouldn't)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'll focus on flattened JSON docs model
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The following formula is used in previous calculation.
> >>>>>>>>>>>>>
> >> storage_size_per_document=mapping_table_size*number_of_revisions +
> >>>>>>>>>>>>> depth*number_of_paths*number_of_revisions +
> >>>>>>>>>>>>> number_of_paths*value_size*number_of_revisions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To clarify things a little bit I want to calculate space
> >> requirement
> >>>>>>>>>>> for
> >>>>>>>>>>>>> single revision this time.
> >>>>>>>>>>>>>
> >> mapping_table_size=field_name_size*(field_name_length+4(integer
> >>>>>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
> >>>>>>>>>>>>>
> >> storage_size_per_document_per_revision_per_replica=mapping_table_size
> >>>>>>>>>>> +
> >>>>>>>>>>>>> depth*number_of_paths + value_size*number_of_paths =
> >>>>>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> We definitely can reduce requirement for mapping table by
> >> adopting
> >>>>>>>>>>>>> rnewson's idea of a schema.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org>
> >> wrote:
> >>>>>>>>>>>>>> Hi Michael,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For example, hears a crazy thought:
> >>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance
> >> through a
> >>>>>>>>>>> crypto hash
> >>>>>>>>>>>>>>> function to get a set of hashes.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in
> >> FDB.
> >>>>>>>>>>> These
> >>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> >>>>>>>>>>> themselves to
> >>>>>>>>>>>>>>> range search well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in
> >> the
> >>>>>>>>>>> same set.
> >>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that
> >> bucket id
> >>>>>>>>>>> becomes a
> >>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more
> >> than one
> >>>>>>>>>>> bucket.
> >>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of
> >>>>>>>>>>> cryptohash values
> >>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding
> >> {val_id} (the
> >>>>>>>>>>> id with
> >>>>>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if
> >> there's
> >>>>>>>>>>> always a
> >>>>>>>>>>>>>>> lot of app local cache misses.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up
> >> to 100k
> >>>>>>>>>>> per
> >>>>>>>>>>>>>>> segment.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the
> >> distinct
> >>>>>>>>>>> buckets
> >>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query
> >> each
> >>>>>>>>>>> bucket while
> >>>>>>>>>>>>>>> reconstructing the document.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Interesting idea. Let's try to think it through to see if we
> >> can
> >>>>>>>>>>> make it viable.
> >>>>>>>>>>>>>> Let's go through hypothetical example. Input data for the
> >> example:
> >>>>>>>>>>>>>> - 1M of documents
> >>>>>>>>>>>>>> - each document is around 10Kb
> >>>>>>>>>>>>>> - each document consists of 1K of unique JSON paths
> >>>>>>>>>>>>>> - each document has 100 unique JSON field names
> >>>>>>>>>>>>>> - every scalar value is 100 bytes
> >>>>>>>>>>>>>> - 10% of unique JSON paths for every document already stored
> >> in
> >>>>>>>>>>> database under different doc or different revision of the
> >> current one
> >>>>>>>>>>>>>> - we assume 3 independent copies for every key-value pair in
> >> FDB
> >>>>>>>>>>>>>> - our hash key size is 32 bytes
> >>>>>>>>>>>>>> - let's assume we can determine if key is already on the
> >> storage
> >>>>>>>>>>> without doing query
> >>>>>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live
> >> the
> >>>>>>>>>>> percentage is lower)
> >>>>>>>>>>>>>> - every JSON field name is 20 bytes
> >>>>>>>>>>>>>> - every JSON path is 10 levels deep
> >>>>>>>>>>>>>> - document key prefix length is 50
> >>>>>>>>>>>>>> - every document has 10 revisions
> >>>>>>>>>>>>>> Let's estimate the storage requirements and size of data we
> >> need to
> >>>>>>>>>>> transmit. The calculations are not exact.
> >>>>>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact
> >> numbers since
> >>>>>>>>>>> we don't know how FDB stores it)
> >>>>>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32
> >> bytes) =
> >>>>>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x)
> >>>>>>>>>>>>>> 2. number of independent keys to retrieve on document read
> >>>>>>>>>>> (non-range queries) per document
> >>>>>>>>>>>>>> - 1K - (1K * 1%) = 990
> >>>>>>>>>>>>>> 3. number of range queries: 0
> >>>>>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes +
> >> 32
> >>>>>>>>>>> bytes) = 102 Kb (10x)
> >>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> >>>>>>>>>>> https://apple.github.io/foundationdb/performance.html)
> >>>>>>>>>>>>>> - sequential: 990*2ms = 1980ms
> >>>>>>>>>>>>>> - range: 0
> >>>>>>>>>>>>>> Let's compare these numbers with initial proposal (flattened
> >> JSON
> >>>>>>>>>>> docs without global schema and without cache)
> >>>>>>>>>>>>>> 1. storage_size_per_document
> >>>>>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400
> >> bytes
> >>>>>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
> >>>>>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10
> >> =
> >>>>>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
> >>>>>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on
> >> index
> >>>>>>>>>>> structure)
> >>>>>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result)
> >>>>>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 =
> >> 23.6 Kb
> >>>>>>>>>>> (2.4x)
> >>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> >>>>>>>>>>> https://apple.github.io/foundationdb/performance.html and
> >> estimate range
> >>>>>>>>>>> read performance based on numbers from
> >>>>>>>>>>>
> >> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
> >>>>>>>>>>> )
> >>>>>>>>>>>>>> - range read performance: Given read performance is about
> >> 305,000
> >>>>>>>>>>> reads/second and range performance 3,600,000 keys/second we
> >> estimate range
> >>>>>>>>>>> performance to be 11.8x compared to read performance. If read
> >> performance
> >>>>>>>>>>> is 2ms than range performance is 0.169ms (which is hard to
> >> believe).
> >>>>>>>>>>>>>> - sequential: 2 * 2 = 4ms
> >>>>>>>>>>>>>> - range: 0.169
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It looks like we are dealing with a tradeoff:
> >>>>>>>>>>>>>> - Map every distinct occurrence of a key/value instance
> >> through a
> >>>>>>>>>>> crypto hash:
> >>>>>>>>>>>>>> - 5.39x more disk space efficient
> >>>>>>>>>>>>>> - 474x slower
> >>>>>>>>>>>>>> - flattened JSON model
> >>>>>>>>>>>>>> - 5.39x less efficient in disk space
> >>>>>>>>>>>>>> - 474x faster
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In any case this unscientific exercise was very helpful.
> >> Since it
> >>>>>>>>>>> uncovered the high cost in terms of disk space. 59.3x of
> >> original disk size
> >>>>>>>>>>> is too much IMO.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Are the any ways we can make Michael's model more performant?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also I don't quite understand few aspects of the global hash
> >> table
> >>>>>>>>>>> proposal:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance
> >> through
> >>>>>>>>>>> a crypto hash function to get a set of hashes.
> >>>>>>>>>>>>>> I think we are talking only about scalar values here? I.e.
> >>>>>>>>>>> `"#/foo.bar.baz": 123`
> >>>>>>>>>>>>>> Since I don't know how we can make it work for all possible
> >> JSON
> >>>>>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
> >>>>>>>>>>>>>> - foo
> >>>>>>>>>>>>>> - foo.bar
> >>>>>>>>>>>>>> - foo.bar.baz
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. how to delete documents
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best regards,
> >>>>>>>>>>>>>> ILYA
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <
> >> mich...@daclubhouse.net>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <
> >> kocol...@apache.org
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi Michael,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit
> >>>>>>>>>>> conflicts. I
> >>>>>>>>>>>>>>>> think there are other, more compact representations that
> >> we can
> >>>>>>>>>>> explore if
> >>>>>>>>>>>>>>>> we have this “exploded” data model where each scalar value
> >> maps
> >>>>>>>>>>> to an
> >>>>>>>>>>>>>>>> individual KV pair.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a
> >> scheme,
> >>>>>>>>>>> that
> >>>>>>>>>>>>>>> handles both conflicts and revisions, where you only have
> >> to store
> >>>>>>>>>>> the most
> >>>>>>>>>>>>>>> recent change to a field.  Like you suggested, multiple
> >> revisions
> >>>>>>>>>>> can share
> >>>>>>>>>>>>>>> a key.  Which in my mind's eye further begs the
> >> conflicts/revisions
> >>>>>>>>>>>>>>> discussion along with the working within the limits
> >> discussion
> >>>>>>>>>>> because it
> >>>>>>>>>>>>>>> seems to me they are all intrinsically related as a
> >> "feature".
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k
> >> segments', then
> >>>>>>>>>>> trying to
> >>>>>>>>>>>>>>> overlay some kind of field sharing scheme for
> >> revisions/conflicts
> >>>>>>>>>>> doesn't
> >>>>>>>>>>>>>>> seem like it will work.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I probably should have left out the trivial fix proposal as
> >> I
> >>>>>>>>>>> don't think
> >>>>>>>>>>>>>>> it's a feasible solution to actually use.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The comment is more regarding that I do not see how this
> >> thread
> >>>>>>>>>>> can escape
> >>>>>>>>>>>>>>> including how to store/retrieve conflicts/revisions.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends
> >> itself
> >>>>>>>>>>> to value
> >>>>>>>>>>>>>>> sharing across mutiple documents (and I don't just mean
> >> revisions
> >>>>>>>>>>> of the
> >>>>>>>>>>>>>>> same doc, I mean the same key/value instance could be
> >> shared for
> >>>>>>>>>>> every
> >>>>>>>>>>>>>>> document).
> >>>>>>>>>>>>>>> However that's not really relevant if we're not considering
> >> the
> >>>>>>>>>>> amount of
> >>>>>>>>>>>>>>> shared information across documents in the storage scheme.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some
> >> kind of
> >>>>>>>>>>>>>>> compressed binary representation) to deal with that FDB
> >> limit
> >>>>>>>>>>> seems fine.
> >>>>>>>>>>>>>>> The only reason to consider doing something else is because
> >> of its
> >>>>>>>>>>> impact
> >>>>>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk
> >> size
> >>>>>>>>>>> impact,
> >>>>>>>>>>>>>>> etc.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the
> >>>>>>>>>>> document into
> >>>>>>>>>>>>>>>> an array and then request the value of each key as multiple
> >>>>>>>>>>> parallel
> >>>>>>>>>>>>>>>> queries against FDB at once
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s
> >> trying
> >>>>>>>>>>> to design a
> >>>>>>>>>>>>>>>> model which allows the retrieval of a document with a
> >> single
> >>>>>>>>>>> range read,
> >>>>>>>>>>>>>>>> which is a good goal in my opinion.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am not sure I agree.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Think of bitTorrent, a single range read should pull back
> >> the
> >>>>>>>>>>> structure of
> >>>>>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the
> >> whole
> >>>>>>>>>>> document.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> What if you already have a bunch of pieces in common with
> >> other
> >>>>>>>>>>> documents
> >>>>>>>>>>>>>>> locally (a repeated header/footer/ or type for example);
> >> and you
> >>>>>>>>>>> only need
> >>>>>>>>>>>>>>> to get a few pieces of data you don't already have?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The real goal to Couch I see is to treat your document set
> >> like the
> >>>>>>>>>>>>>>> collection of structured information that it is.  In some
> >> respects
> >>>>>>>>>>> like an
> >>>>>>>>>>>>>>> extension of your application's heap space for structured
> >> objects
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>> efficiently querying that collection to get back subsets of
> >> the
> >>>>>>>>>>> data.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file
> >> system plus
> >>>>>>>>>>> a fancy
> >>>>>>>>>>>>>>> grep/find like feature...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The best way I see to unlock more features/power is to a
> >> move
> >>>>>>>>>>> towards a
> >>>>>>>>>>>>>>> more granular and efficient way to store and retrieve the
> >> scalar
> >>>>>>>>>>> values...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> For example, hears a crazy thought:
> >>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance
> >> through a
> >>>>>>>>>>> crypto hash
> >>>>>>>>>>>>>>> function to get a set of hashes.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in
> >> FDB.
> >>>>>>>>>>> These
> >>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> >>>>>>>>>>> themselves to
> >>>>>>>>>>>>>>> range search well.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in
> >> the
> >>>>>>>>>>> same set.
> >>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that
> >> bucket id
> >>>>>>>>>>> becomes a
> >>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more
> >> than one
> >>>>>>>>>>> bucket.
> >>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of
> >>>>>>>>>>> cryptohash values
> >>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding
> >> {val_id} (the
> >>>>>>>>>>> id with
> >>>>>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if
> >> there's
> >>>>>>>>>>> always a
> >>>>>>>>>>>>>>> lot of app local cache misses.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up
> >> to 100k
> >>>>>>>>>>> per
> >>>>>>>>>>>>>>> segment.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the
> >> distinct
> >>>>>>>>>>> buckets
> >>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query
> >> each
> >>>>>>>>>>> bucket while
> >>>>>>>>>>>>>>> reconstructing the document.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The values returned from the buckets query are the key/value
> >>>>>>>>>>> strings
> >>>>>>>>>>>>>>> required to reassemble this document.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> ----------
> >>>>>>>>>>>>>>> I put this forward primarily to hilite the idea that trying
> >> to
> >>>>>>>>>>> match the
> >>>>>>>>>>>>>>> storage representation of documents in a straight forward
> >> way to
> >>>>>>>>>>> FDB keys
> >>>>>>>>>>>>>>> to reduce query count might not be the most performance
> >> oriented
> >>>>>>>>>>> approach.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'd much prefer a storage approach that reduced data
> >> duplication
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>> enabled fast sub-document queries.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This clearly falls in the realm of what people want the
> >> 'use case'
> >>>>>>>>>>> of Couch
> >>>>>>>>>>>>>>> to be/become.  By giving Couch more access to sub-document
> >>>>>>>>>>> queries, I could
> >>>>>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted
> >> to
> >>>>>>>>>>> Couch and
> >>>>>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple
> >> documents in a
> >>>>>>>>>>> single
> >>>>>>>>>>>>>>> application layer request.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hehe - one way to look at the database of Couch documents
> >> is that
> >>>>>>>>>>> they are
> >>>>>>>>>>>>>>> all conflict revisions of the single root empty document.
> >> What I
> >>>>>>>>>>> mean be
> >>>>>>>>>>>>>>> this is consider thinking of the entire document store as
> >> one
> >>>>>>>>>>> giant DAG of
> >>>>>>>>>>>>>>> key/value pairs. How even separate documents are still
> >> typically
> >>>>>>>>>>> related to
> >>>>>>>>>>>>>>> each other.  For most applications there is a tremendous
> >> amount of
> >>>>>>>>>>> data
> >>>>>>>>>>>>>>> redundancy between docs and especially between revisions of
> >> those
> >>>>>>>>>>> docs...
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> And all this is a long way of saying "I think there could
> >> be a lot
> >>>>>>>>>>> of value
> >>>>>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries
> >> to
> >>>>>>>>>>> FDB, with
> >>>>>>>>>>>>>>> local caching, instead of simply retrieved"
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this
> >> way!?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Mike :-)
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to