Would it be too much work to prototype both and check CRUD timings for each across a small variety of documents?
-Joan ----- Original Message ----- > From: "Paul Davis" <paul.joseph.da...@gmail.com> > To: dev@couchdb.apache.org > Sent: Tuesday, February 19, 2019 5:41:23 PM > Subject: Re: [DISCUSS] : things we need to solve/decide : storing JSON > documents > > A simple doc storage version number would likely be enough for future > us to > do fancier things. > > On Tue, Feb 19, 2019 at 4:16 PM Benjamin Anderson > <banjie...@apache.org> > wrote: > > > > I don’t think adding a layer of abstraction is the right move > > > just yet, > > I think we should continue to find consensus on one answer to this > > question > > > > Agree that the theorycrafting stage is not optimal for making > > abstraction decisions, but I suspect it would be worthwhile > > somewhere > > between prototyping and releasing. Adam's proposal does seem to me > > the > > most appealing approach on the surface, and I don't see anyone > > signing > > up to do the work to deliver an alternative concurrently. > > > > -- > > ba > > > > On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson > > <rnew...@apache.org> > > wrote: > > > > > > Addendum: By “directory aliasing” I meant within a document > > > (either the > > actual Directory thing or something equivalent of our own making). > > The > > directory aliasing for each database is a good way to reduce key > > size > > without a significant cost. Though if Redwood lands in time, even > > this > > would become an inutile obfuscation]. > > > > > > > On 19 Feb 2019, at 21:39, Robert Samuel Newson > > > > <rnew...@apache.org> > > wrote: > > > > > > > > Interesting suggestion, obviously the details might get the > > > > wrong kind > > of fun. > > > > > > > > Somewhere above I suggested this would be something we could > > > > change > > over time and even use different approaches for different documents > > within > > the same database. This is the long way of saying there are > > multiple ways > > to do this each with advantages and none without disadvantages. > > > > > > > > I don’t think adding a layer of abstraction is the right move > > > > just > > yet, I think we should continue to find consensus on one answer to > > this > > question (and the related ones in other threads) for the first > > release. > > It’s easy to say “we can change it later”, of course. We can, > > though it > > would be a chunk of work in the context of something that already > > works, > > I’ve rarely seen anyone sign up for that. > > > > > > > > I’m fine with the first proposal from Adam, where the keys are > > > > tuples > > of key parts pointing at terminal values. To make it easier for the > > first > > version, I would exclude optimisations like deduplication or the > > Directory > > aliasing or the schema thing that I suggested and that Ilya > > incorporated a > > variant of in a follow-up post. We’d accept that there are limits > > on the > > sizes of documents, including the awkward-to-express one about > > property > > depth. > > > > > > > > Stepping back, I’m not seeing any essential improvement over > > > > Adam’s > > original proposal besides the few corrections and clarifications > > made by > > various authors. Could we start an RFC based on Adam’s original > > proposal on > > document body, revision tree and index storage? We could then have > > PR’s > > against that for each additional optimisation (one person’s > > optimisation is > > another person’s needless complication)? > > > > > > > > If I’ve missed some genuine advance on the original proposal in > > > > this > > long thread, please call it out for me. > > > > > > > > B. > > > > > > > >> On 19 Feb 2019, at 21:15, Benjamin Anderson > > > >> <banjie...@apache.org> > > wrote: > > > >> > > > >> As is evident by the length of this thread, there's a pretty > > > >> big > > > >> design space to cover here, and it seems unlikely we'll have > > > >> arrived > > > >> at a "correct" solution even by the time this thing ships. > > > >> Perhaps it > > > >> would be worthwhile to treat the in-FDB representation of data > > > >> as a > > > >> first-class abstraction and support multiple representations > > > >> simultaneously? > > > >> > > > >> Obviously there's no such thing as a zero-cost abstraction - > > > >> and I've > > > >> not thought very hard about how far up the stack the document > > > >> representation would need to leak - but supporting different > > > >> layouts > > > >> (primarily, as Adam points out, on the document body itself) > > > >> might > > > >> prove interesting and useful. I'm sure there are folks > > > >> interested in a > > > >> column-shaped CouchDB, for example. > > > >> > > > >> -- > > > >> b > > > >> > > > >> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson > > > >> <rnew...@apache.org> > > wrote: > > > >>> > > > >>> Good points on revtree, I agree with you we should store that > > intelligently to gain the benefits you mentioned. > > > >>> > > > >>> -- > > > >>> Robert Samuel Newson > > > >>> rnew...@apache.org > > > >>> > > > >>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: > > > >>>> I do not think we should store the revtree as a blob. The > > > >>>> design > > where > > > >>>> each edit branch is its own KV should save on network IO and > > > >>>> CPU > > cycles > > > >>>> for normal updates. We’ve performed too many heroics to keep > > > >>>> couch_key_tree from stalling entire databases when trying to > > > >>>> update > > a > > > >>>> single document with a wide revision tree, I would much > > > >>>> prefer to > > ignore > > > >>>> other edit branches entirely when all we’re doing is > > > >>>> extending one > > of > > > >>>> them. > > > >>>> > > > >>>> I also do not think we should store JSON documents as blobs, > > > >>>> but > > it’s a > > > >>>> closer call. Some of my reasoning for preferring the > > > >>>> exploded path > > > >>>> design: > > > >>>> > > > >>>> - it lends itself nicely to sub-document operations, for > > > >>>> which Jan > > > >>>> crafted an RFC last year: > > https://github.com/apache/couchdb/issues/1559 > > > >>>> - it optimizes the creation of Mango indexes on existing > > > >>>> databases > > since > > > >>>> we only need to retrieve the value(s) we want to index > > > >>>> - it optimizes Mango queries that use field selectors > > > >>>> - anyone who wanted to try their hand at GraphQL will find > > > >>>> it very > > > >>>> handy: https://github.com/apache/couchdb/issues/1499 > > > >>>> - looking further ahead, it lets us play with smarter leaf > > > >>>> value > > types > > > >>>> like Counters (yes I’m still on the CRDT bandwagon, sorry) > > > >>>> > > > >>>> A few comments on the thread: > > > >>>> > > > >>>>>>> * Most documents bodies are probably going to be smaller > > > >>>>>>> than > > 100k. So in > > > >>>>>>> the majority of case it would be one write / one read to > > > >>>>>>> update > > and fetch > > > >>>>>>> the document body. > > > >>>> > > > >>>> We should test, but I expect reading 50KB of data in a range > > > >>>> query > > is > > > >>>> almost as efficient as reading a single 50 KB value. > > > >>>> Similarly, > > writes > > > >>>> to a contiguous set of keys should be quite efficient. > > > >>>> > > > >>>> I am concerned about the overhead of the repeated field > > > >>>> paths in the > > > >>>> keys with the exploded path option in the absence of key > > > >>>> prefix > > > >>>> compression. That would be my main reason to acquiesce and > > > >>>> throw > > away > > > >>>> all the document structure. > > > >>>> > > > >>>> Adam > > > >>>> > > > >>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson > > > >>>>> <rnew...@apache.org> > > wrote: > > > >>>>> > > > >>>>> I like the idea that we'd reuse the same pattern (but > > > >>>>> perhaps not > > the same _code_) for doc bodies, revtree and attachments. > > > >>>>> > > > >>>>> I hope we still get to delete couch_key_tree.erl, though. > > > >>>>> > > > >>>>> -- > > > >>>>> Robert Samuel Newson > > > >>>>> rnew...@apache.org > > > >>>>> > > > >>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: > > > >>>>>> I like the idea from a “trying a simple thing first” > > > >>>>>> perspective, > > but > > > >>>>>> Nick’s points below are especially convincing to with this > > > >>>>>> for > > now. > > > >>>>>> > > > >>>>>> Best > > > >>>>>> Jan > > > >>>>>> — > > > >>>>>> > > > >>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc > > > >>>>>>> <vatam...@gmail.com> > > wrote: > > > >>>>>>> > > > >>>>>>> Hi, > > > >>>>>>> > > > >>>>>>> Sorry for jumping in so late, I was following from the > > > >>>>>>> sidelines > > mostly. A > > > >>>>>>> lot of good discussion happening and am excited about the > > possibilities > > > >>>>>>> here. > > > >>>>>>> > > > >>>>>>> I do like the simpler "chunking" approach for a few > > > >>>>>>> reasons: > > > >>>>>>> > > > >>>>>>> * Most documents bodies are probably going to be smaller > > > >>>>>>> than > > 100k. So in > > > >>>>>>> the majority of case it would be one write / one read to > > > >>>>>>> update > > and fetch > > > >>>>>>> the document body. > > > >>>>>>> > > > >>>>>>> * We could reuse the chunking code for attachment > > > >>>>>>> handling and > > possibly > > > >>>>>>> revision key trees. So it's the general pattern of upload > > > >>>>>>> chunks > > to some > > > >>>>>>> prefix, and when finished flip an atomic toggle to make > > > >>>>>>> it > > current. > > > >>>>>>> > > > >>>>>>> * Do the same thing with revision trees and we could > > > >>>>>>> re-use the > > revision > > > >>>>>>> tree manipulation logic. That is, the key tree in most > > > >>>>>>> cases > > would be small > > > >>>>>>> enough to fit in 100k but if they get huge, they'd get > > > >>>>>>> chunked. > > This would > > > >>>>>>> allow us to reuse all the battle tested couch_key_tree > > > >>>>>>> code > > mostly as is. > > > >>>>>>> We even have property tests for it > > > >>>>>>> > > https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl > > > >>>>>>> > > > >>>>>>> * It removes the need to explain the max exploded path > > > >>>>>>> length > > limitation to > > > >>>>>>> customers. > > > >>>>>>> > > > >>>>>>> Cheers, > > > >>>>>>> -Nick > > > >>>>>>> > > > >>>>>>> > > > >>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson < > > rnew...@apache.org> wrote: > > > >>>>>>> > > > >>>>>>>> Hi, > > > >>>>>>>> > > > >>>>>>>> An alternative storage model that we should seriously > > > >>>>>>>> consider > > is to > > > >>>>>>>> follow our current approach in couch_file et al. > > > >>>>>>>> Specifically, > > that the > > > >>>>>>>> document _body_ is stored as an uninterpreted binary > > > >>>>>>>> value. > > This would be > > > >>>>>>>> much like the obvious plan for attachment storage; a key > > > >>>>>>>> prefix > > that > > > >>>>>>>> identifies the database and document, with the final > > > >>>>>>>> item of > > that key tuple > > > >>>>>>>> is an incrementing integer. Each of those keys has a > > > >>>>>>>> binary > > value of up to > > > >>>>>>>> 100k. Fetching all values with that key prefix, in fdb's > > natural ordering, > > > >>>>>>>> will yield the full document body, which can be JSON > > > >>>>>>>> decoded > > for further > > > >>>>>>>> processing. > > > >>>>>>>> > > > >>>>>>>> I like this idea, and I like Adam's original proposal to > > explode documents > > > >>>>>>>> into property paths. I have a slight preference for the > > simplicity of the > > > >>>>>>>> idea in the previous paragraph, not least because it's > > > >>>>>>>> close to > > what we do > > > >>>>>>>> today. I also think it will be possible to migrate to > > alternative storage > > > >>>>>>>> models in future, and foundationdb's transaction > > > >>>>>>>> supports means > > we can do > > > >>>>>>>> this migration seamlessly should we come to it. > > > >>>>>>>> > > > >>>>>>>> I'm very interested in knowing if anyone else is > > > >>>>>>>> interested in > > going this > > > >>>>>>>> simple, or considers it a wasted opportunity relative to > > > >>>>>>>> the > > 'exploded' > > > >>>>>>>> path. > > > >>>>>>>> > > > >>>>>>>> B. > > > >>>>>>>> > > > >>>>>>>> -- > > > >>>>>>>> Robert Samuel Newson > > > >>>>>>>> rnew...@apache.org > > > >>>>>>>> > > > >>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: > > > >>>>>>>>> I've been remiss here in not posting the data model > > > >>>>>>>>> ideas that > > IBM > > > >>>>>>>>> worked up while we were thinking about using > > > >>>>>>>>> FoundationDB so > > I'm posting > > > >>>>>>>>> it now. This is Adam' Kocoloski's original work, I am > > > >>>>>>>>> just > > transcribing > > > >>>>>>>>> it, and this is the context that the folks from the IBM > > > >>>>>>>>> side > > came in > > > >>>>>>>>> with, for full disclosure. > > > >>>>>>>>> > > > >>>>>>>>> Basics > > > >>>>>>>>> > > > >>>>>>>>> 1. All CouchDB databases are inside a Directory > > > >>>>>>>>> 2. Each CouchDB database is a Directory within that > > > >>>>>>>>> Directory > > > >>>>>>>>> 3. It's possible to list all subdirectories of a > > > >>>>>>>>> Directory, so > > > >>>>>>>>> `_all_dbs` is the list of directories from 1. > > > >>>>>>>>> 4. Each Directory representing a CouchdB database has > > > >>>>>>>>> several > > Subspaces; > > > >>>>>>>>> 4a. by_id/ doc subspace: actual document contents > > > >>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed > > > >>>>>>>>> 4c. index_definitions, indexes, ... > > > >>>>>>>>> > > > >>>>>>>>> JSON Mapping > > > >>>>>>>>> > > > >>>>>>>>> A hierarchical JSON object naturally maps to multiple > > > >>>>>>>>> KV pairs > > in FDB: > > > >>>>>>>>> > > > >>>>>>>>> { > > > >>>>>>>>> “_id”: “foo”, > > > >>>>>>>>> “owner”: “bob”, > > > >>>>>>>>> “mylist”: [1,3,5], > > > >>>>>>>>> “mymap”: { > > > >>>>>>>>> “blue”: “#0000FF”, > > > >>>>>>>>> “red”: “#FF0000” > > > >>>>>>>>> } > > > >>>>>>>>> } > > > >>>>>>>>> > > > >>>>>>>>> maps to > > > >>>>>>>>> > > > >>>>>>>>> (“foo”, “owner”) = “bob” > > > >>>>>>>>> (“foo”, “mylist”, 0) = 1 > > > >>>>>>>>> (“foo”, “mylist”, 1) = 3 > > > >>>>>>>>> (“foo”, “mylist”, 2) = 5 > > > >>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF” > > > >>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000” > > > >>>>>>>>> > > > >>>>>>>>> NB: this means that the 100KB limit applies to > > > >>>>>>>>> individual > > leafs in the > > > >>>>>>>>> JSON object, not the entire doc > > > >>>>>>>>> > > > >>>>>>>>> Edit Conflicts > > > >>>>>>>>> > > > >>>>>>>>> We need to account for the presence of conflicts in > > > >>>>>>>>> various > > levels of > > > >>>>>>>>> the doc due to replication. > > > >>>>>>>>> > > > >>>>>>>>> Proposal is to create a special value indicating that > > > >>>>>>>>> the > > subtree below > > > >>>>>>>>> our current cursor position is in an unresolvable > > > >>>>>>>>> conflict. > > Then add > > > >>>>>>>>> additional KV pairs below to describe the conflicting > > > >>>>>>>>> entries. > > > >>>>>>>>> > > > >>>>>>>>> KV data model allows us to store these efficiently and > > > >>>>>>>>> minimize > > > >>>>>>>>> duplication of data: > > > >>>>>>>>> > > > >>>>>>>>> A document with these two conflicts: > > > >>>>>>>>> > > > >>>>>>>>> { > > > >>>>>>>>> “_id”: “foo”, > > > >>>>>>>>> “_rev”: “1-abc”, > > > >>>>>>>>> “owner”: “alice”, > > > >>>>>>>>> “active”: true > > > >>>>>>>>> } > > > >>>>>>>>> { > > > >>>>>>>>> “_id”: “foo”, > > > >>>>>>>>> “_rev”: “1-def”, > > > >>>>>>>>> “owner”: “bob”, > > > >>>>>>>>> “active”: true > > > >>>>>>>>> } > > > >>>>>>>>> > > > >>>>>>>>> could be stored thus: > > > >>>>>>>>> > > > >>>>>>>>> (“foo”, “active”) = true > > > >>>>>>>>> (“foo”, “owner”) = kCONFLICT > > > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > > > >>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” > > > >>>>>>>>> > > > >>>>>>>>> So long as `kCONFLICT` is set at the top of the > > > >>>>>>>>> conflicting > > subtree this > > > >>>>>>>>> representation can handle conflicts of different data > > > >>>>>>>>> types as > > well. > > > >>>>>>>>> > > > >>>>>>>>> Missing fields need to be handled explicitly: > > > >>>>>>>>> > > > >>>>>>>>> { > > > >>>>>>>>> “_id”: “foo”, > > > >>>>>>>>> “_rev”: “1-abc”, > > > >>>>>>>>> “owner”: “alice”, > > > >>>>>>>>> “active”: true > > > >>>>>>>>> } > > > >>>>>>>>> > > > >>>>>>>>> { > > > >>>>>>>>> “_id”: “foo”, > > > >>>>>>>>> “_rev”: “1-def”, > > > >>>>>>>>> “owner”: { > > > >>>>>>>>> “name”: “bob”, > > > >>>>>>>>> “email”: “ > > > >>>>>>>>> b...@example.com > > > >>>>>>>>> " > > > >>>>>>>>> } > > > >>>>>>>>> } > > > >>>>>>>>> > > > >>>>>>>>> could be stored thus: > > > >>>>>>>>> > > > >>>>>>>>> (“foo”, “active”) = kCONFLICT > > > >>>>>>>>> (“foo”, “active”, “1-abc”) = true > > > >>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING > > > >>>>>>>>> (“foo”, “owner”) = kCONFLICT > > > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > > > >>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” > > > >>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ... > > > >>>>>>>>> > > > >>>>>>>>> Revision Metadata > > > >>>>>>>>> > > > >>>>>>>>> * CouchDB uses a hash history for revisions > > > >>>>>>>>> ** Each edit is identified by the hash of the content > > > >>>>>>>>> of the > > edit > > > >>>>>>>>> including the base revision against which it was > > > >>>>>>>>> applied > > > >>>>>>>>> ** Individual edit branches are bounded in length but > > > >>>>>>>>> the > > number of > > > >>>>>>>>> branches is potentially unbounded > > > >>>>>>>>> > > > >>>>>>>>> * Size limits preclude us from storing the entire key > > > >>>>>>>>> tree as > > a single > > > >>>>>>>>> value; in pathological situations > > > >>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes) > > > >>>>>>>>> > > > >>>>>>>>> * Store each edit branch as a separate KV including > > > >>>>>>>>> deleted > > status in a > > > >>>>>>>>> special subspace > > > >>>>>>>>> > > > >>>>>>>>> * Structure key representation so that “winning” > > > >>>>>>>>> revision can > > be > > > >>>>>>>>> automatically retrieved in a limit=1 > > > >>>>>>>>> key range operation > > > >>>>>>>>> > > > >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] > > > >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = > > [“3-baz”,”2-bar”,”1-foo”] > > > >>>>>>>>> <-- winner > > > >>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, > > > >>>>>>>>> “1-foo”] > > > >>>>>>>>> > > > >>>>>>>>> Changes Feed > > > >>>>>>>>> > > > >>>>>>>>> * FDB supports a concept called a versionstamp — a 10 > > > >>>>>>>>> byte, > > unique, > > > >>>>>>>>> monotonically (but not sequentially) increasing value > > > >>>>>>>>> for each > > committed > > > >>>>>>>>> transaction. The first 8 bytes are the committed > > > >>>>>>>>> version of the > > > >>>>>>>>> database. The last 2 bytes are monotonic in the > > > >>>>>>>>> serialization > > order for > > > >>>>>>>>> transactions. > > > >>>>>>>>> > > > >>>>>>>>> * A transaction can specify a particular index into a > > > >>>>>>>>> key > > where the > > > >>>>>>>>> following 10 bytes will be overwritten by the > > > >>>>>>>>> versionstamp at > > commit > > > >>>>>>>>> time > > > >>>>>>>>> > > > >>>>>>>>> * A subspace keyed on versionstamp naturally yields a > > > >>>>>>>>> _changes > > feed > > > >>>>>>>>> > > > >>>>>>>>> by_seq subspace > > > >>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”) > > > >>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”) > > > >>>>>>>>> > > > >>>>>>>>> by_id subspace > > > >>>>>>>>> (“bar”, “_vsn”) = “versionstamp4” > > > >>>>>>>>> ... > > > >>>>>>>>> (“foo”, “_vsn”) = “versionstamp1” > > > >>>>>>>>> > > > >>>>>>>>> JSON Indexes > > > >>>>>>>>> > > > >>>>>>>>> * “Mango” JSON indexes are defined by > > > >>>>>>>>> ** a list of field names, each of which may be nested, > > > >>>>>>>>> ** an optional partial_filter_selector which constrains > > > >>>>>>>>> the > > set of docs > > > >>>>>>>>> that contribute > > > >>>>>>>>> ** an optional name defined by the ddoc field (the name > > > >>>>>>>>> is > > auto- > > > >>>>>>>>> generated if not supplied) > > > >>>>>>>>> > > > >>>>>>>>> * Store index definitions in a single subspace to aid > > > >>>>>>>>> query > > planning > > > >>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, > > “{“student”: > > > >>>>>>>>> true}”) > > > >>>>>>>>> ** Store the values for each index in a dedicated > > > >>>>>>>>> subspace, > > adding the > > > >>>>>>>>> document ID as the last element in the tuple > > > >>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", > > > >>>>>>>>> “foo”) > > = null > > > >>>>>>>>> > > > >>>>>>>>> B. > > > >>>>>>>>> > > > >>>>>>>>> -- > > > >>>>>>>>> Robert Samuel Newson > > > >>>>>>>>> rnew...@apache.org > > > >>>>>>>>> > > > >>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: > > > >>>>>>>>>> > > > >>>>>>>>>> I want to fix previous mistakes. I did two mistakes in > > previous > > > >>>>>>>>>> calculations: > > > >>>>>>>>>> - I used 1Kb as base size for calculating expansion > > > >>>>>>>>>> factor > > (although > > > >>>>>>>> we > > > >>>>>>>>>> don't know exact size of original document) > > > >>>>>>>>>> - The expansion factor calculation included number of > > revisions (it > > > >>>>>>>>>> shouldn't) > > > >>>>>>>>>> > > > >>>>>>>>>> I'll focus on flattened JSON docs model > > > >>>>>>>>>> > > > >>>>>>>>>> The following formula is used in previous calculation. > > > >>>>>>>>>> > > storage_size_per_document=mapping_table_size*number_of_revisions + > > > >>>>>>>>>> depth*number_of_paths*number_of_revisions + > > > >>>>>>>>>> number_of_paths*value_size*number_of_revisions > > > >>>>>>>>>> > > > >>>>>>>>>> To clarify things a little bit I want to calculate > > > >>>>>>>>>> space > > requirement > > > >>>>>>>> for > > > >>>>>>>>>> single revision this time. > > > >>>>>>>>>> > > mapping_table_size=field_name_size*(field_name_length+4(integer > > > >>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes > > > >>>>>>>>>> > > storage_size_per_document_per_revision_per_replica=mapping_table_size > > > >>>>>>>> + > > > >>>>>>>>>> depth*number_of_paths + value_size*number_of_paths = > > > >>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb > > > >>>>>>>>>> > > > >>>>>>>>>> We definitely can reduce requirement for mapping table > > > >>>>>>>>>> by > > adopting > > > >>>>>>>>>> rnewson's idea of a schema. > > > >>>>>>>>>> > > > >>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov > > > >>>>>>>>>> <iil...@apache.org> > > wrote: > > > >>>>>>>>>>> Hi Michael, > > > >>>>>>>>>>> > > > >>>>>>>>>>>> For example, hears a crazy thought: > > > >>>>>>>>>>>> Map every distinct occurence of a key/value instance > > through a > > > >>>>>>>> crypto hash > > > >>>>>>>>>>>> function to get a set of hashes. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> These can be be precomputed by Couch without any > > > >>>>>>>>>>>> lookups in > > FDB. > > > >>>>>>>> These > > > >>>>>>>>>>>> will be spread all over kingdom come in FDB and not > > > >>>>>>>>>>>> lend > > > >>>>>>>> themselves to > > > >>>>>>>>>>>> range search well. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> So what you do is index them for frequency of > > > >>>>>>>>>>>> occurring in > > the > > > >>>>>>>> same set. > > > >>>>>>>>>>>> In essence, you 'bucket them' statistically, and > > > >>>>>>>>>>>> that > > bucket id > > > >>>>>>>> becomes a > > > >>>>>>>>>>>> key prefix. A crypto hash value can be copied into > > > >>>>>>>>>>>> more > > than one > > > >>>>>>>> bucket. > > > >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > > > >>>>>>>>>>> > > > >>>>>>>>>>>> When writing a document, Couch submits the > > > >>>>>>>>>>>> list/array of > > > >>>>>>>> cryptohash values > > > >>>>>>>>>>>> it computed to FDB and gets back the corresponding > > {val_id} (the > > > >>>>>>>> id with > > > >>>>>>>>>>>> the bucket prefixed). This can get somewhat > > > >>>>>>>>>>>> expensive if > > there's > > > >>>>>>>> always a > > > >>>>>>>>>>>> lot of app local cache misses. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> A document's value is then a series of {val_id} > > > >>>>>>>>>>>> arrays up > > to 100k > > > >>>>>>>> per > > > >>>>>>>>>>>> segment. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> When retrieving a document, you get the val_ids, > > > >>>>>>>>>>>> find the > > distinct > > > >>>>>>>> buckets > > > >>>>>>>>>>>> and min/max entries for this doc, and then parallel > > > >>>>>>>>>>>> query > > each > > > >>>>>>>> bucket while > > > >>>>>>>>>>>> reconstructing the document. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Interesting idea. Let's try to think it through to > > > >>>>>>>>>>> see if we > > can > > > >>>>>>>> make it viable. > > > >>>>>>>>>>> Let's go through hypothetical example. Input data for > > > >>>>>>>>>>> the > > example: > > > >>>>>>>>>>> - 1M of documents > > > >>>>>>>>>>> - each document is around 10Kb > > > >>>>>>>>>>> - each document consists of 1K of unique JSON paths > > > >>>>>>>>>>> - each document has 100 unique JSON field names > > > >>>>>>>>>>> - every scalar value is 100 bytes > > > >>>>>>>>>>> - 10% of unique JSON paths for every document already > > > >>>>>>>>>>> stored > > in > > > >>>>>>>> database under different doc or different revision of > > > >>>>>>>> the > > current one > > > >>>>>>>>>>> - we assume 3 independent copies for every key-value > > > >>>>>>>>>>> pair in > > FDB > > > >>>>>>>>>>> - our hash key size is 32 bytes > > > >>>>>>>>>>> - let's assume we can determine if key is already on > > > >>>>>>>>>>> the > > storage > > > >>>>>>>> without doing query > > > >>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real > > > >>>>>>>>>>> live > > the > > > >>>>>>>> percentage is lower) > > > >>>>>>>>>>> - every JSON field name is 20 bytes > > > >>>>>>>>>>> - every JSON path is 10 levels deep > > > >>>>>>>>>>> - document key prefix length is 50 > > > >>>>>>>>>>> - every document has 10 revisions > > > >>>>>>>>>>> Let's estimate the storage requirements and size of > > > >>>>>>>>>>> data we > > need to > > > >>>>>>>> transmit. The calculations are not exact. > > > >>>>>>>>>>> 1. storage_size_per_document (we cannot estimate > > > >>>>>>>>>>> exact > > numbers since > > > >>>>>>>> we don't know how FDB stores it) > > > >>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * > > > >>>>>>>>>>> 32 > > bytes) = > > > >>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x) > > > >>>>>>>>>>> 2. number of independent keys to retrieve on document > > > >>>>>>>>>>> read > > > >>>>>>>> (non-range queries) per document > > > >>>>>>>>>>> - 1K - (1K * 1%) = 990 > > > >>>>>>>>>>> 3. number of range queries: 0 > > > >>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 > > > >>>>>>>>>>> bytes + > > 32 > > > >>>>>>>> bytes) = 102 Kb (10x) > > > >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers > > > >>>>>>>>>>> from > > > >>>>>>>> https://apple.github.io/foundationdb/performance.html) > > > >>>>>>>>>>> - sequential: 990*2ms = 1980ms > > > >>>>>>>>>>> - range: 0 > > > >>>>>>>>>>> Let's compare these numbers with initial proposal > > > >>>>>>>>>>> (flattened > > JSON > > > >>>>>>>> docs without global schema and without cache) > > > >>>>>>>>>>> 1. storage_size_per_document > > > >>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = > > > >>>>>>>>>>> 2400 > > bytes > > > >>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 > > > >>>>>>>>>>> bytes > > > >>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + > > > >>>>>>>>>>> 1K*100*10 > > = > > > >>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) > > > >>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 > > > >>>>>>>>>>> (depending on > > index > > > >>>>>>>> structure) > > > >>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in > > > >>>>>>>>>>> result) > > > >>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + > > > >>>>>>>>>>> 1000*100 = > > 23.6 Kb > > > >>>>>>>> (2.4x) > > > >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers > > > >>>>>>>>>>> from > > > >>>>>>>> https://apple.github.io/foundationdb/performance.html > > > >>>>>>>> and > > estimate range > > > >>>>>>>> read performance based on numbers from > > > >>>>>>>> > > https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test > > > >>>>>>>> ) > > > >>>>>>>>>>> - range read performance: Given read performance is > > > >>>>>>>>>>> about > > 305,000 > > > >>>>>>>> reads/second and range performance 3,600,000 keys/second > > > >>>>>>>> we > > estimate range > > > >>>>>>>> performance to be 11.8x compared to read performance. If > > > >>>>>>>> read > > performance > > > >>>>>>>> is 2ms than range performance is 0.169ms (which is hard > > > >>>>>>>> to > > believe). > > > >>>>>>>>>>> - sequential: 2 * 2 = 4ms > > > >>>>>>>>>>> - range: 0.169 > > > >>>>>>>>>>> > > > >>>>>>>>>>> It looks like we are dealing with a tradeoff: > > > >>>>>>>>>>> - Map every distinct occurrence of a key/value > > > >>>>>>>>>>> instance > > through a > > > >>>>>>>> crypto hash: > > > >>>>>>>>>>> - 5.39x more disk space efficient > > > >>>>>>>>>>> - 474x slower > > > >>>>>>>>>>> - flattened JSON model > > > >>>>>>>>>>> - 5.39x less efficient in disk space > > > >>>>>>>>>>> - 474x faster > > > >>>>>>>>>>> > > > >>>>>>>>>>> In any case this unscientific exercise was very > > > >>>>>>>>>>> helpful. > > Since it > > > >>>>>>>> uncovered the high cost in terms of disk space. 59.3x of > > original disk size > > > >>>>>>>> is too much IMO. > > > >>>>>>>>>>> > > > >>>>>>>>>>> Are the any ways we can make Michael's model more > > > >>>>>>>>>>> performant? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Also I don't quite understand few aspects of the > > > >>>>>>>>>>> global hash > > table > > > >>>>>>>> proposal: > > > >>>>>>>>>>> > > > >>>>>>>>>>> 1. > - Map every distinct occurence of a key/value > > > >>>>>>>>>>> instance > > through > > > >>>>>>>> a crypto hash function to get a set of hashes. > > > >>>>>>>>>>> I think we are talking only about scalar values here? > > > >>>>>>>>>>> I.e. > > > >>>>>>>> `"#/foo.bar.baz": 123` > > > >>>>>>>>>>> Since I don't know how we can make it work for all > > > >>>>>>>>>>> possible > > JSON > > > >>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": > > > >>>>>>>>>>> - foo > > > >>>>>>>>>>> - foo.bar > > > >>>>>>>>>>> - foo.bar.baz > > > >>>>>>>>>>> > > > >>>>>>>>>>> 2. how to delete documents > > > >>>>>>>>>>> > > > >>>>>>>>>>> Best regards, > > > >>>>>>>>>>> ILYA > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair < > > mich...@daclubhouse.net> > > > >>>>>>>> wrote: > > > >>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski < > > kocol...@apache.org > > > >>>>>>>> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>> Hi Michael, > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as > > > >>>>>>>>>>>>>> DOC_KEY. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Yes that’s definitely one way to address storage of > > > >>>>>>>>>>>>> edit > > > >>>>>>>> conflicts. I > > > >>>>>>>>>>>>> think there are other, more compact representations > > > >>>>>>>>>>>>> that > > we can > > > >>>>>>>> explore if > > > >>>>>>>>>>>>> we have this “exploded” data model where each > > > >>>>>>>>>>>>> scalar value > > maps > > > >>>>>>>> to an > > > >>>>>>>>>>>>> individual KV pair. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I agree, as I mentioned on the original thread, I > > > >>>>>>>>>>>> see a > > scheme, > > > >>>>>>>> that > > > >>>>>>>>>>>> handles both conflicts and revisions, where you only > > > >>>>>>>>>>>> have > > to store > > > >>>>>>>> the most > > > >>>>>>>>>>>> recent change to a field. Like you suggested, > > > >>>>>>>>>>>> multiple > > revisions > > > >>>>>>>> can share > > > >>>>>>>>>>>> a key. Which in my mind's eye further begs the > > conflicts/revisions > > > >>>>>>>>>>>> discussion along with the working within the limits > > discussion > > > >>>>>>>> because it > > > >>>>>>>>>>>> seems to me they are all intrinsically related as a > > "feature". > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k > > segments', then > > > >>>>>>>> trying to > > > >>>>>>>>>>>> overlay some kind of field sharing scheme for > > revisions/conflicts > > > >>>>>>>> doesn't > > > >>>>>>>>>>>> seem like it will work. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I probably should have left out the trivial fix > > > >>>>>>>>>>>> proposal as > > I > > > >>>>>>>> don't think > > > >>>>>>>>>>>> it's a feasible solution to actually use. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The comment is more regarding that I do not see how > > > >>>>>>>>>>>> this > > thread > > > >>>>>>>> can escape > > > >>>>>>>>>>>> including how to store/retrieve conflicts/revisions. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> For instance, the 'doc as individual fields' > > > >>>>>>>>>>>> proposal lends > > itself > > > >>>>>>>> to value > > > >>>>>>>>>>>> sharing across mutiple documents (and I don't just > > > >>>>>>>>>>>> mean > > revisions > > > >>>>>>>> of the > > > >>>>>>>>>>>> same doc, I mean the same key/value instance could > > > >>>>>>>>>>>> be > > shared for > > > >>>>>>>> every > > > >>>>>>>>>>>> document). > > > >>>>>>>>>>>> However that's not really relevant if we're not > > > >>>>>>>>>>>> considering > > the > > > >>>>>>>> amount of > > > >>>>>>>>>>>> shared information across documents in the storage > > > >>>>>>>>>>>> scheme. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Simply storing documents in <100k segments (perhaps > > > >>>>>>>>>>>> in some > > kind of > > > >>>>>>>>>>>> compressed binary representation) to deal with that > > > >>>>>>>>>>>> FDB > > limit > > > >>>>>>>> seems fine. > > > >>>>>>>>>>>> The only reason to consider doing something else is > > > >>>>>>>>>>>> because > > of its > > > >>>>>>>> impact > > > >>>>>>>>>>>> to indexing, searches, reduce functions, revisions, > > > >>>>>>>>>>>> on-disk > > size > > > >>>>>>>> impact, > > > >>>>>>>>>>>> etc. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>>>> I'm assuming the process will flatten the key > > > >>>>>>>>>>>>>> paths of the > > > >>>>>>>> document into > > > >>>>>>>>>>>>> an array and then request the value of each key as > > > >>>>>>>>>>>>> multiple > > > >>>>>>>> parallel > > > >>>>>>>>>>>>> queries against FDB at once > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. > > > >>>>>>>>>>>>> He’s > > trying > > > >>>>>>>> to design a > > > >>>>>>>>>>>>> model which allows the retrieval of a document with > > > >>>>>>>>>>>>> a > > single > > > >>>>>>>> range read, > > > >>>>>>>>>>>>> which is a good goal in my opinion. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I am not sure I agree. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Think of bitTorrent, a single range read should pull > > > >>>>>>>>>>>> back > > the > > > >>>>>>>> structure of > > > >>>>>>>>>>>> the document (the pieces to fetch), but not > > > >>>>>>>>>>>> necessarily the > > whole > > > >>>>>>>> document. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> What if you already have a bunch of pieces in common > > > >>>>>>>>>>>> with > > other > > > >>>>>>>> documents > > > >>>>>>>>>>>> locally (a repeated header/footer/ or type for > > > >>>>>>>>>>>> example); > > and you > > > >>>>>>>> only need > > > >>>>>>>>>>>> to get a few pieces of data you don't already have? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The real goal to Couch I see is to treat your > > > >>>>>>>>>>>> document set > > like the > > > >>>>>>>>>>>> collection of structured information that it is. In > > > >>>>>>>>>>>> some > > respects > > > >>>>>>>> like an > > > >>>>>>>>>>>> extension of your application's heap space for > > > >>>>>>>>>>>> structured > > objects > > > >>>>>>>> and > > > >>>>>>>>>>>> efficiently querying that collection to get back > > > >>>>>>>>>>>> subsets of > > the > > > >>>>>>>> data. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Otherwise it seems more like a slightly upgraded > > > >>>>>>>>>>>> file > > system plus > > > >>>>>>>> a fancy > > > >>>>>>>>>>>> grep/find like feature... > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The best way I see to unlock more features/power is > > > >>>>>>>>>>>> to a > > move > > > >>>>>>>> towards a > > > >>>>>>>>>>>> more granular and efficient way to store and > > > >>>>>>>>>>>> retrieve the > > scalar > > > >>>>>>>> values... > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> For example, hears a crazy thought: > > > >>>>>>>>>>>> Map every distinct occurence of a key/value instance > > through a > > > >>>>>>>> crypto hash > > > >>>>>>>>>>>> function to get a set of hashes. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> These can be be precomputed by Couch without any > > > >>>>>>>>>>>> lookups in > > FDB. > > > >>>>>>>> These > > > >>>>>>>>>>>> will be spread all over kingdom come in FDB and not > > > >>>>>>>>>>>> lend > > > >>>>>>>> themselves to > > > >>>>>>>>>>>> range search well. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> So what you do is index them for frequency of > > > >>>>>>>>>>>> occurring in > > the > > > >>>>>>>> same set. > > > >>>>>>>>>>>> In essence, you 'bucket them' statistically, and > > > >>>>>>>>>>>> that > > bucket id > > > >>>>>>>> becomes a > > > >>>>>>>>>>>> key prefix. A crypto hash value can be copied into > > > >>>>>>>>>>>> more > > than one > > > >>>>>>>> bucket. > > > >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> When writing a document, Couch submits the > > > >>>>>>>>>>>> list/array of > > > >>>>>>>> cryptohash values > > > >>>>>>>>>>>> it computed to FDB and gets back the corresponding > > {val_id} (the > > > >>>>>>>> id with > > > >>>>>>>>>>>> the bucket prefixed). This can get somewhat > > > >>>>>>>>>>>> expensive if > > there's > > > >>>>>>>> always a > > > >>>>>>>>>>>> lot of app local cache misses. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> A document's value is then a series of {val_id} > > > >>>>>>>>>>>> arrays up > > to 100k > > > >>>>>>>> per > > > >>>>>>>>>>>> segment. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> When retrieving a document, you get the val_ids, > > > >>>>>>>>>>>> find the > > distinct > > > >>>>>>>> buckets > > > >>>>>>>>>>>> and min/max entries for this doc, and then parallel > > > >>>>>>>>>>>> query > > each > > > >>>>>>>> bucket while > > > >>>>>>>>>>>> reconstructing the document. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> The values returned from the buckets query are the > > > >>>>>>>>>>>> key/value > > > >>>>>>>> strings > > > >>>>>>>>>>>> required to reassemble this document. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> ---------- > > > >>>>>>>>>>>> I put this forward primarily to hilite the idea that > > > >>>>>>>>>>>> trying > > to > > > >>>>>>>> match the > > > >>>>>>>>>>>> storage representation of documents in a straight > > > >>>>>>>>>>>> forward > > way to > > > >>>>>>>> FDB keys > > > >>>>>>>>>>>> to reduce query count might not be the most > > > >>>>>>>>>>>> performance > > oriented > > > >>>>>>>> approach. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I'd much prefer a storage approach that reduced data > > duplication > > > >>>>>>>> and > > > >>>>>>>>>>>> enabled fast sub-document queries. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> This clearly falls in the realm of what people want > > > >>>>>>>>>>>> the > > 'use case' > > > >>>>>>>> of Couch > > > >>>>>>>>>>>> to be/become. By giving Couch more access to > > > >>>>>>>>>>>> sub-document > > > >>>>>>>> queries, I could > > > >>>>>>>>>>>> eventually see queries as complicated as GraphQL > > > >>>>>>>>>>>> submitted > > to > > > >>>>>>>> Couch and > > > >>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple > > documents in a > > > >>>>>>>> single > > > >>>>>>>>>>>> application layer request. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Hehe - one way to look at the database of Couch > > > >>>>>>>>>>>> documents > > is that > > > >>>>>>>> they are > > > >>>>>>>>>>>> all conflict revisions of the single root empty > > > >>>>>>>>>>>> document. > > What I > > > >>>>>>>> mean be > > > >>>>>>>>>>>> this is consider thinking of the entire document > > > >>>>>>>>>>>> store as > > one > > > >>>>>>>> giant DAG of > > > >>>>>>>>>>>> key/value pairs. How even separate documents are > > > >>>>>>>>>>>> still > > typically > > > >>>>>>>> related to > > > >>>>>>>>>>>> each other. For most applications there is a > > > >>>>>>>>>>>> tremendous > > amount of > > > >>>>>>>> data > > > >>>>>>>>>>>> redundancy between docs and especially between > > > >>>>>>>>>>>> revisions of > > those > > > >>>>>>>> docs... > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> And all this is a long way of saying "I think there > > > >>>>>>>>>>>> could > > be a lot > > > >>>>>>>> of value > > > >>>>>>>>>>>> in assuming documents are 'assembled' from multiple > > > >>>>>>>>>>>> queries > > to > > > >>>>>>>> FDB, with > > > >>>>>>>>>>>> local caching, instead of simply retrieved" > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Thanks, I hope I'm not the only outlier here > > > >>>>>>>>>>>> thinking this > > way!? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Mike :-) > > > >>>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>> > > > >>>>>> > > > >>>> > > > > > > > > > >