Strongly agree that we very much don't want to have Erlang-isms being pushed into fdb. Regardless of what we end up with I'd like to see a very strong (de)?serialization layer with some significant test coverage.
On Tue, Feb 19, 2019 at 6:54 PM Adam Kocoloski <kocol...@apache.org> wrote: > > Yes, that sort of versioning has been omitted from the various concrete > proposals but we definitely want to have it. We’ve seen the alternative in > some of the Erlang records that we serialize to disk today and it ain’t > pretty. > > I can imagine that we’ll want to have the codebase laid out in a way that > allows us to upgrade to a smarter KV encoding over time without major > surgery, which I think is a good “layer of abstraction”. I would be nervous > if we started having abstract containers of data structures pushed down into > FDB itself :) > > Adam > > > On Feb 19, 2019, at 5:41 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > > > > A simple doc storage version number would likely be enough for future us to > > do fancier things. > > > > On Tue, Feb 19, 2019 at 4:16 PM Benjamin Anderson <banjie...@apache.org> > > wrote: > > > >>> I don’t think adding a layer of abstraction is the right move just yet, > >> I think we should continue to find consensus on one answer to this question > >> > >> Agree that the theorycrafting stage is not optimal for making > >> abstraction decisions, but I suspect it would be worthwhile somewhere > >> between prototyping and releasing. Adam's proposal does seem to me the > >> most appealing approach on the surface, and I don't see anyone signing > >> up to do the work to deliver an alternative concurrently. > >> > >> -- > >> ba > >> > >> On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson <rnew...@apache.org> > >> wrote: > >>> > >>> Addendum: By “directory aliasing” I meant within a document (either the > >> actual Directory thing or something equivalent of our own making). The > >> directory aliasing for each database is a good way to reduce key size > >> without a significant cost. Though if Redwood lands in time, even this > >> would become an inutile obfuscation]. > >>> > >>>> On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org> > >> wrote: > >>>> > >>>> Interesting suggestion, obviously the details might get the wrong kind > >> of fun. > >>>> > >>>> Somewhere above I suggested this would be something we could change > >> over time and even use different approaches for different documents within > >> the same database. This is the long way of saying there are multiple ways > >> to do this each with advantages and none without disadvantages. > >>>> > >>>> I don’t think adding a layer of abstraction is the right move just > >> yet, I think we should continue to find consensus on one answer to this > >> question (and the related ones in other threads) for the first release. > >> It’s easy to say “we can change it later”, of course. We can, though it > >> would be a chunk of work in the context of something that already works, > >> I’ve rarely seen anyone sign up for that. > >>>> > >>>> I’m fine with the first proposal from Adam, where the keys are tuples > >> of key parts pointing at terminal values. To make it easier for the first > >> version, I would exclude optimisations like deduplication or the Directory > >> aliasing or the schema thing that I suggested and that Ilya incorporated a > >> variant of in a follow-up post. We’d accept that there are limits on the > >> sizes of documents, including the awkward-to-express one about property > >> depth. > >>>> > >>>> Stepping back, I’m not seeing any essential improvement over Adam’s > >> original proposal besides the few corrections and clarifications made by > >> various authors. Could we start an RFC based on Adam’s original proposal on > >> document body, revision tree and index storage? We could then have PR’s > >> against that for each additional optimisation (one person’s optimisation is > >> another person’s needless complication)? > >>>> > >>>> If I’ve missed some genuine advance on the original proposal in this > >> long thread, please call it out for me. > >>>> > >>>> B. > >>>> > >>>>> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org> > >> wrote: > >>>>> > >>>>> As is evident by the length of this thread, there's a pretty big > >>>>> design space to cover here, and it seems unlikely we'll have arrived > >>>>> at a "correct" solution even by the time this thing ships. Perhaps it > >>>>> would be worthwhile to treat the in-FDB representation of data as a > >>>>> first-class abstraction and support multiple representations > >>>>> simultaneously? > >>>>> > >>>>> Obviously there's no such thing as a zero-cost abstraction - and I've > >>>>> not thought very hard about how far up the stack the document > >>>>> representation would need to leak - but supporting different layouts > >>>>> (primarily, as Adam points out, on the document body itself) might > >>>>> prove interesting and useful. I'm sure there are folks interested in a > >>>>> column-shaped CouchDB, for example. > >>>>> > >>>>> -- > >>>>> b > >>>>> > >>>>> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> > >> wrote: > >>>>>> > >>>>>> Good points on revtree, I agree with you we should store that > >> intelligently to gain the benefits you mentioned. > >>>>>> > >>>>>> -- > >>>>>> Robert Samuel Newson > >>>>>> rnew...@apache.org > >>>>>> > >>>>>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: > >>>>>>> I do not think we should store the revtree as a blob. The design > >> where > >>>>>>> each edit branch is its own KV should save on network IO and CPU > >> cycles > >>>>>>> for normal updates. We’ve performed too many heroics to keep > >>>>>>> couch_key_tree from stalling entire databases when trying to update > >> a > >>>>>>> single document with a wide revision tree, I would much prefer to > >> ignore > >>>>>>> other edit branches entirely when all we’re doing is extending one > >> of > >>>>>>> them. > >>>>>>> > >>>>>>> I also do not think we should store JSON documents as blobs, but > >> it’s a > >>>>>>> closer call. Some of my reasoning for preferring the exploded path > >>>>>>> design: > >>>>>>> > >>>>>>> - it lends itself nicely to sub-document operations, for which Jan > >>>>>>> crafted an RFC last year: > >> https://github.com/apache/couchdb/issues/1559 > >>>>>>> - it optimizes the creation of Mango indexes on existing databases > >> since > >>>>>>> we only need to retrieve the value(s) we want to index > >>>>>>> - it optimizes Mango queries that use field selectors > >>>>>>> - anyone who wanted to try their hand at GraphQL will find it very > >>>>>>> handy: https://github.com/apache/couchdb/issues/1499 > >>>>>>> - looking further ahead, it lets us play with smarter leaf value > >> types > >>>>>>> like Counters (yes I’m still on the CRDT bandwagon, sorry) > >>>>>>> > >>>>>>> A few comments on the thread: > >>>>>>> > >>>>>>>>>> * Most documents bodies are probably going to be smaller than > >> 100k. So in > >>>>>>>>>> the majority of case it would be one write / one read to update > >> and fetch > >>>>>>>>>> the document body. > >>>>>>> > >>>>>>> We should test, but I expect reading 50KB of data in a range query > >> is > >>>>>>> almost as efficient as reading a single 50 KB value. Similarly, > >> writes > >>>>>>> to a contiguous set of keys should be quite efficient. > >>>>>>> > >>>>>>> I am concerned about the overhead of the repeated field paths in the > >>>>>>> keys with the exploded path option in the absence of key prefix > >>>>>>> compression. That would be my main reason to acquiesce and throw > >> away > >>>>>>> all the document structure. > >>>>>>> > >>>>>>> Adam > >>>>>>> > >>>>>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> > >> wrote: > >>>>>>>> > >>>>>>>> I like the idea that we'd reuse the same pattern (but perhaps not > >> the same _code_) for doc bodies, revtree and attachments. > >>>>>>>> > >>>>>>>> I hope we still get to delete couch_key_tree.erl, though. > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Robert Samuel Newson > >>>>>>>> rnew...@apache.org > >>>>>>>> > >>>>>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: > >>>>>>>>> I like the idea from a “trying a simple thing first” perspective, > >> but > >>>>>>>>> Nick’s points below are especially convincing to with this for > >> now. > >>>>>>>>> > >>>>>>>>> Best > >>>>>>>>> Jan > >>>>>>>>> — > >>>>>>>>> > >>>>>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> > >> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> Sorry for jumping in so late, I was following from the sidelines > >> mostly. A > >>>>>>>>>> lot of good discussion happening and am excited about the > >> possibilities > >>>>>>>>>> here. > >>>>>>>>>> > >>>>>>>>>> I do like the simpler "chunking" approach for a few reasons: > >>>>>>>>>> > >>>>>>>>>> * Most documents bodies are probably going to be smaller than > >> 100k. So in > >>>>>>>>>> the majority of case it would be one write / one read to update > >> and fetch > >>>>>>>>>> the document body. > >>>>>>>>>> > >>>>>>>>>> * We could reuse the chunking code for attachment handling and > >> possibly > >>>>>>>>>> revision key trees. So it's the general pattern of upload chunks > >> to some > >>>>>>>>>> prefix, and when finished flip an atomic toggle to make it > >> current. > >>>>>>>>>> > >>>>>>>>>> * Do the same thing with revision trees and we could re-use the > >> revision > >>>>>>>>>> tree manipulation logic. That is, the key tree in most cases > >> would be small > >>>>>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. > >> This would > >>>>>>>>>> allow us to reuse all the battle tested couch_key_tree code > >> mostly as is. > >>>>>>>>>> We even have property tests for it > >>>>>>>>>> > >> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl > >>>>>>>>>> > >>>>>>>>>> * It removes the need to explain the max exploded path length > >> limitation to > >>>>>>>>>> customers. > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> -Nick > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson < > >> rnew...@apache.org> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> An alternative storage model that we should seriously consider > >> is to > >>>>>>>>>>> follow our current approach in couch_file et al. Specifically, > >> that the > >>>>>>>>>>> document _body_ is stored as an uninterpreted binary value. > >> This would be > >>>>>>>>>>> much like the obvious plan for attachment storage; a key prefix > >> that > >>>>>>>>>>> identifies the database and document, with the final item of > >> that key tuple > >>>>>>>>>>> is an incrementing integer. Each of those keys has a binary > >> value of up to > >>>>>>>>>>> 100k. Fetching all values with that key prefix, in fdb's > >> natural ordering, > >>>>>>>>>>> will yield the full document body, which can be JSON decoded > >> for further > >>>>>>>>>>> processing. > >>>>>>>>>>> > >>>>>>>>>>> I like this idea, and I like Adam's original proposal to > >> explode documents > >>>>>>>>>>> into property paths. I have a slight preference for the > >> simplicity of the > >>>>>>>>>>> idea in the previous paragraph, not least because it's close to > >> what we do > >>>>>>>>>>> today. I also think it will be possible to migrate to > >> alternative storage > >>>>>>>>>>> models in future, and foundationdb's transaction supports means > >> we can do > >>>>>>>>>>> this migration seamlessly should we come to it. > >>>>>>>>>>> > >>>>>>>>>>> I'm very interested in knowing if anyone else is interested in > >> going this > >>>>>>>>>>> simple, or considers it a wasted opportunity relative to the > >> 'exploded' > >>>>>>>>>>> path. > >>>>>>>>>>> > >>>>>>>>>>> B. > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Robert Samuel Newson > >>>>>>>>>>> rnew...@apache.org > >>>>>>>>>>> > >>>>>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: > >>>>>>>>>>>> I've been remiss here in not posting the data model ideas that > >> IBM > >>>>>>>>>>>> worked up while we were thinking about using FoundationDB so > >> I'm posting > >>>>>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just > >> transcribing > >>>>>>>>>>>> it, and this is the context that the folks from the IBM side > >> came in > >>>>>>>>>>>> with, for full disclosure. > >>>>>>>>>>>> > >>>>>>>>>>>> Basics > >>>>>>>>>>>> > >>>>>>>>>>>> 1. All CouchDB databases are inside a Directory > >>>>>>>>>>>> 2. Each CouchDB database is a Directory within that Directory > >>>>>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so > >>>>>>>>>>>> `_all_dbs` is the list of directories from 1. > >>>>>>>>>>>> 4. Each Directory representing a CouchdB database has several > >> Subspaces; > >>>>>>>>>>>> 4a. by_id/ doc subspace: actual document contents > >>>>>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed > >>>>>>>>>>>> 4c. index_definitions, indexes, ... > >>>>>>>>>>>> > >>>>>>>>>>>> JSON Mapping > >>>>>>>>>>>> > >>>>>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs > >> in FDB: > >>>>>>>>>>>> > >>>>>>>>>>>> { > >>>>>>>>>>>> “_id”: “foo”, > >>>>>>>>>>>> “owner”: “bob”, > >>>>>>>>>>>> “mylist”: [1,3,5], > >>>>>>>>>>>> “mymap”: { > >>>>>>>>>>>> “blue”: “#0000FF”, > >>>>>>>>>>>> “red”: “#FF0000” > >>>>>>>>>>>> } > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> maps to > >>>>>>>>>>>> > >>>>>>>>>>>> (“foo”, “owner”) = “bob” > >>>>>>>>>>>> (“foo”, “mylist”, 0) = 1 > >>>>>>>>>>>> (“foo”, “mylist”, 1) = 3 > >>>>>>>>>>>> (“foo”, “mylist”, 2) = 5 > >>>>>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF” > >>>>>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000” > >>>>>>>>>>>> > >>>>>>>>>>>> NB: this means that the 100KB limit applies to individual > >> leafs in the > >>>>>>>>>>>> JSON object, not the entire doc > >>>>>>>>>>>> > >>>>>>>>>>>> Edit Conflicts > >>>>>>>>>>>> > >>>>>>>>>>>> We need to account for the presence of conflicts in various > >> levels of > >>>>>>>>>>>> the doc due to replication. > >>>>>>>>>>>> > >>>>>>>>>>>> Proposal is to create a special value indicating that the > >> subtree below > >>>>>>>>>>>> our current cursor position is in an unresolvable conflict. > >> Then add > >>>>>>>>>>>> additional KV pairs below to describe the conflicting entries. > >>>>>>>>>>>> > >>>>>>>>>>>> KV data model allows us to store these efficiently and minimize > >>>>>>>>>>>> duplication of data: > >>>>>>>>>>>> > >>>>>>>>>>>> A document with these two conflicts: > >>>>>>>>>>>> > >>>>>>>>>>>> { > >>>>>>>>>>>> “_id”: “foo”, > >>>>>>>>>>>> “_rev”: “1-abc”, > >>>>>>>>>>>> “owner”: “alice”, > >>>>>>>>>>>> “active”: true > >>>>>>>>>>>> } > >>>>>>>>>>>> { > >>>>>>>>>>>> “_id”: “foo”, > >>>>>>>>>>>> “_rev”: “1-def”, > >>>>>>>>>>>> “owner”: “bob”, > >>>>>>>>>>>> “active”: true > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> could be stored thus: > >>>>>>>>>>>> > >>>>>>>>>>>> (“foo”, “active”) = true > >>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT > >>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > >>>>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” > >>>>>>>>>>>> > >>>>>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting > >> subtree this > >>>>>>>>>>>> representation can handle conflicts of different data types as > >> well. > >>>>>>>>>>>> > >>>>>>>>>>>> Missing fields need to be handled explicitly: > >>>>>>>>>>>> > >>>>>>>>>>>> { > >>>>>>>>>>>> “_id”: “foo”, > >>>>>>>>>>>> “_rev”: “1-abc”, > >>>>>>>>>>>> “owner”: “alice”, > >>>>>>>>>>>> “active”: true > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> { > >>>>>>>>>>>> “_id”: “foo”, > >>>>>>>>>>>> “_rev”: “1-def”, > >>>>>>>>>>>> “owner”: { > >>>>>>>>>>>> “name”: “bob”, > >>>>>>>>>>>> “email”: “ > >>>>>>>>>>>> b...@example.com > >>>>>>>>>>>> " > >>>>>>>>>>>> } > >>>>>>>>>>>> } > >>>>>>>>>>>> > >>>>>>>>>>>> could be stored thus: > >>>>>>>>>>>> > >>>>>>>>>>>> (“foo”, “active”) = kCONFLICT > >>>>>>>>>>>> (“foo”, “active”, “1-abc”) = true > >>>>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING > >>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT > >>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” > >>>>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” > >>>>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ... > >>>>>>>>>>>> > >>>>>>>>>>>> Revision Metadata > >>>>>>>>>>>> > >>>>>>>>>>>> * CouchDB uses a hash history for revisions > >>>>>>>>>>>> ** Each edit is identified by the hash of the content of the > >> edit > >>>>>>>>>>>> including the base revision against which it was applied > >>>>>>>>>>>> ** Individual edit branches are bounded in length but the > >> number of > >>>>>>>>>>>> branches is potentially unbounded > >>>>>>>>>>>> > >>>>>>>>>>>> * Size limits preclude us from storing the entire key tree as > >> a single > >>>>>>>>>>>> value; in pathological situations > >>>>>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes) > >>>>>>>>>>>> > >>>>>>>>>>>> * Store each edit branch as a separate KV including deleted > >> status in a > >>>>>>>>>>>> special subspace > >>>>>>>>>>>> > >>>>>>>>>>>> * Structure key representation so that “winning” revision can > >> be > >>>>>>>>>>>> automatically retrieved in a limit=1 > >>>>>>>>>>>> key range operation > >>>>>>>>>>>> > >>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] > >>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = > >> [“3-baz”,”2-bar”,”1-foo”] > >>>>>>>>>>>> <-- winner > >>>>>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”] > >>>>>>>>>>>> > >>>>>>>>>>>> Changes Feed > >>>>>>>>>>>> > >>>>>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte, > >> unique, > >>>>>>>>>>>> monotonically (but not sequentially) increasing value for each > >> committed > >>>>>>>>>>>> transaction. The first 8 bytes are the committed version of the > >>>>>>>>>>>> database. The last 2 bytes are monotonic in the serialization > >> order for > >>>>>>>>>>>> transactions. > >>>>>>>>>>>> > >>>>>>>>>>>> * A transaction can specify a particular index into a key > >> where the > >>>>>>>>>>>> following 10 bytes will be overwritten by the versionstamp at > >> commit > >>>>>>>>>>>> time > >>>>>>>>>>>> > >>>>>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes > >> feed > >>>>>>>>>>>> > >>>>>>>>>>>> by_seq subspace > >>>>>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”) > >>>>>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”) > >>>>>>>>>>>> > >>>>>>>>>>>> by_id subspace > >>>>>>>>>>>> (“bar”, “_vsn”) = “versionstamp4” > >>>>>>>>>>>> ... > >>>>>>>>>>>> (“foo”, “_vsn”) = “versionstamp1” > >>>>>>>>>>>> > >>>>>>>>>>>> JSON Indexes > >>>>>>>>>>>> > >>>>>>>>>>>> * “Mango” JSON indexes are defined by > >>>>>>>>>>>> ** a list of field names, each of which may be nested, > >>>>>>>>>>>> ** an optional partial_filter_selector which constrains the > >> set of docs > >>>>>>>>>>>> that contribute > >>>>>>>>>>>> ** an optional name defined by the ddoc field (the name is > >> auto- > >>>>>>>>>>>> generated if not supplied) > >>>>>>>>>>>> > >>>>>>>>>>>> * Store index definitions in a single subspace to aid query > >> planning > >>>>>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, > >> “{“student”: > >>>>>>>>>>>> true}”) > >>>>>>>>>>>> ** Store the values for each index in a dedicated subspace, > >> adding the > >>>>>>>>>>>> document ID as the last element in the tuple > >>>>>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) > >> = null > >>>>>>>>>>>> > >>>>>>>>>>>> B. > >>>>>>>>>>>> > >>>>>>>>>>>> -- > >>>>>>>>>>>> Robert Samuel Newson > >>>>>>>>>>>> rnew...@apache.org > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> I want to fix previous mistakes. I did two mistakes in > >> previous > >>>>>>>>>>>>> calculations: > >>>>>>>>>>>>> - I used 1Kb as base size for calculating expansion factor > >> (although > >>>>>>>>>>> we > >>>>>>>>>>>>> don't know exact size of original document) > >>>>>>>>>>>>> - The expansion factor calculation included number of > >> revisions (it > >>>>>>>>>>>>> shouldn't) > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'll focus on flattened JSON docs model > >>>>>>>>>>>>> > >>>>>>>>>>>>> The following formula is used in previous calculation. > >>>>>>>>>>>>> > >> storage_size_per_document=mapping_table_size*number_of_revisions + > >>>>>>>>>>>>> depth*number_of_paths*number_of_revisions + > >>>>>>>>>>>>> number_of_paths*value_size*number_of_revisions > >>>>>>>>>>>>> > >>>>>>>>>>>>> To clarify things a little bit I want to calculate space > >> requirement > >>>>>>>>>>> for > >>>>>>>>>>>>> single revision this time. > >>>>>>>>>>>>> > >> mapping_table_size=field_name_size*(field_name_length+4(integer > >>>>>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes > >>>>>>>>>>>>> > >> storage_size_per_document_per_revision_per_replica=mapping_table_size > >>>>>>>>>>> + > >>>>>>>>>>>>> depth*number_of_paths + value_size*number_of_paths = > >>>>>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb > >>>>>>>>>>>>> > >>>>>>>>>>>>> We definitely can reduce requirement for mapping table by > >> adopting > >>>>>>>>>>>>> rnewson's idea of a schema. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> > >> wrote: > >>>>>>>>>>>>>> Hi Michael, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> For example, hears a crazy thought: > >>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance > >> through a > >>>>>>>>>>> crypto hash > >>>>>>>>>>>>>>> function to get a set of hashes. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in > >> FDB. > >>>>>>>>>>> These > >>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend > >>>>>>>>>>> themselves to > >>>>>>>>>>>>>>> range search well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in > >> the > >>>>>>>>>>> same set. > >>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that > >> bucket id > >>>>>>>>>>> becomes a > >>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more > >> than one > >>>>>>>>>>> bucket. > >>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of > >>>>>>>>>>> cryptohash values > >>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding > >> {val_id} (the > >>>>>>>>>>> id with > >>>>>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if > >> there's > >>>>>>>>>>> always a > >>>>>>>>>>>>>>> lot of app local cache misses. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up > >> to 100k > >>>>>>>>>>> per > >>>>>>>>>>>>>>> segment. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the > >> distinct > >>>>>>>>>>> buckets > >>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query > >> each > >>>>>>>>>>> bucket while > >>>>>>>>>>>>>>> reconstructing the document. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Interesting idea. Let's try to think it through to see if we > >> can > >>>>>>>>>>> make it viable. > >>>>>>>>>>>>>> Let's go through hypothetical example. Input data for the > >> example: > >>>>>>>>>>>>>> - 1M of documents > >>>>>>>>>>>>>> - each document is around 10Kb > >>>>>>>>>>>>>> - each document consists of 1K of unique JSON paths > >>>>>>>>>>>>>> - each document has 100 unique JSON field names > >>>>>>>>>>>>>> - every scalar value is 100 bytes > >>>>>>>>>>>>>> - 10% of unique JSON paths for every document already stored > >> in > >>>>>>>>>>> database under different doc or different revision of the > >> current one > >>>>>>>>>>>>>> - we assume 3 independent copies for every key-value pair in > >> FDB > >>>>>>>>>>>>>> - our hash key size is 32 bytes > >>>>>>>>>>>>>> - let's assume we can determine if key is already on the > >> storage > >>>>>>>>>>> without doing query > >>>>>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live > >> the > >>>>>>>>>>> percentage is lower) > >>>>>>>>>>>>>> - every JSON field name is 20 bytes > >>>>>>>>>>>>>> - every JSON path is 10 levels deep > >>>>>>>>>>>>>> - document key prefix length is 50 > >>>>>>>>>>>>>> - every document has 10 revisions > >>>>>>>>>>>>>> Let's estimate the storage requirements and size of data we > >> need to > >>>>>>>>>>> transmit. The calculations are not exact. > >>>>>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact > >> numbers since > >>>>>>>>>>> we don't know how FDB stores it) > >>>>>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 > >> bytes) = > >>>>>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x) > >>>>>>>>>>>>>> 2. number of independent keys to retrieve on document read > >>>>>>>>>>> (non-range queries) per document > >>>>>>>>>>>>>> - 1K - (1K * 1%) = 990 > >>>>>>>>>>>>>> 3. number of range queries: 0 > >>>>>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + > >> 32 > >>>>>>>>>>> bytes) = 102 Kb (10x) > >>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from > >>>>>>>>>>> https://apple.github.io/foundationdb/performance.html) > >>>>>>>>>>>>>> - sequential: 990*2ms = 1980ms > >>>>>>>>>>>>>> - range: 0 > >>>>>>>>>>>>>> Let's compare these numbers with initial proposal (flattened > >> JSON > >>>>>>>>>>> docs without global schema and without cache) > >>>>>>>>>>>>>> 1. storage_size_per_document > >>>>>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 > >> bytes > >>>>>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes > >>>>>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 > >> = > >>>>>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) > >>>>>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on > >> index > >>>>>>>>>>> structure) > >>>>>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result) > >>>>>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = > >> 23.6 Kb > >>>>>>>>>>> (2.4x) > >>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from > >>>>>>>>>>> https://apple.github.io/foundationdb/performance.html and > >> estimate range > >>>>>>>>>>> read performance based on numbers from > >>>>>>>>>>> > >> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test > >>>>>>>>>>> ) > >>>>>>>>>>>>>> - range read performance: Given read performance is about > >> 305,000 > >>>>>>>>>>> reads/second and range performance 3,600,000 keys/second we > >> estimate range > >>>>>>>>>>> performance to be 11.8x compared to read performance. If read > >> performance > >>>>>>>>>>> is 2ms than range performance is 0.169ms (which is hard to > >> believe). > >>>>>>>>>>>>>> - sequential: 2 * 2 = 4ms > >>>>>>>>>>>>>> - range: 0.169 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> It looks like we are dealing with a tradeoff: > >>>>>>>>>>>>>> - Map every distinct occurrence of a key/value instance > >> through a > >>>>>>>>>>> crypto hash: > >>>>>>>>>>>>>> - 5.39x more disk space efficient > >>>>>>>>>>>>>> - 474x slower > >>>>>>>>>>>>>> - flattened JSON model > >>>>>>>>>>>>>> - 5.39x less efficient in disk space > >>>>>>>>>>>>>> - 474x faster > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In any case this unscientific exercise was very helpful. > >> Since it > >>>>>>>>>>> uncovered the high cost in terms of disk space. 59.3x of > >> original disk size > >>>>>>>>>>> is too much IMO. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Are the any ways we can make Michael's model more performant? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Also I don't quite understand few aspects of the global hash > >> table > >>>>>>>>>>> proposal: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance > >> through > >>>>>>>>>>> a crypto hash function to get a set of hashes. > >>>>>>>>>>>>>> I think we are talking only about scalar values here? I.e. > >>>>>>>>>>> `"#/foo.bar.baz": 123` > >>>>>>>>>>>>>> Since I don't know how we can make it work for all possible > >> JSON > >>>>>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": > >>>>>>>>>>>>>> - foo > >>>>>>>>>>>>>> - foo.bar > >>>>>>>>>>>>>> - foo.bar.baz > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 2. how to delete documents > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best regards, > >>>>>>>>>>>>>> ILYA > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair < > >> mich...@daclubhouse.net> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski < > >> kocol...@apache.org > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Hi Michael, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit > >>>>>>>>>>> conflicts. I > >>>>>>>>>>>>>>>> think there are other, more compact representations that > >> we can > >>>>>>>>>>> explore if > >>>>>>>>>>>>>>>> we have this “exploded” data model where each scalar value > >> maps > >>>>>>>>>>> to an > >>>>>>>>>>>>>>>> individual KV pair. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a > >> scheme, > >>>>>>>>>>> that > >>>>>>>>>>>>>>> handles both conflicts and revisions, where you only have > >> to store > >>>>>>>>>>> the most > >>>>>>>>>>>>>>> recent change to a field. Like you suggested, multiple > >> revisions > >>>>>>>>>>> can share > >>>>>>>>>>>>>>> a key. Which in my mind's eye further begs the > >> conflicts/revisions > >>>>>>>>>>>>>>> discussion along with the working within the limits > >> discussion > >>>>>>>>>>> because it > >>>>>>>>>>>>>>> seems to me they are all intrinsically related as a > >> "feature". > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k > >> segments', then > >>>>>>>>>>> trying to > >>>>>>>>>>>>>>> overlay some kind of field sharing scheme for > >> revisions/conflicts > >>>>>>>>>>> doesn't > >>>>>>>>>>>>>>> seem like it will work. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I probably should have left out the trivial fix proposal as > >> I > >>>>>>>>>>> don't think > >>>>>>>>>>>>>>> it's a feasible solution to actually use. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The comment is more regarding that I do not see how this > >> thread > >>>>>>>>>>> can escape > >>>>>>>>>>>>>>> including how to store/retrieve conflicts/revisions. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends > >> itself > >>>>>>>>>>> to value > >>>>>>>>>>>>>>> sharing across mutiple documents (and I don't just mean > >> revisions > >>>>>>>>>>> of the > >>>>>>>>>>>>>>> same doc, I mean the same key/value instance could be > >> shared for > >>>>>>>>>>> every > >>>>>>>>>>>>>>> document). > >>>>>>>>>>>>>>> However that's not really relevant if we're not considering > >> the > >>>>>>>>>>> amount of > >>>>>>>>>>>>>>> shared information across documents in the storage scheme. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some > >> kind of > >>>>>>>>>>>>>>> compressed binary representation) to deal with that FDB > >> limit > >>>>>>>>>>> seems fine. > >>>>>>>>>>>>>>> The only reason to consider doing something else is because > >> of its > >>>>>>>>>>> impact > >>>>>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk > >> size > >>>>>>>>>>> impact, > >>>>>>>>>>>>>>> etc. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the > >>>>>>>>>>> document into > >>>>>>>>>>>>>>>> an array and then request the value of each key as multiple > >>>>>>>>>>> parallel > >>>>>>>>>>>>>>>> queries against FDB at once > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s > >> trying > >>>>>>>>>>> to design a > >>>>>>>>>>>>>>>> model which allows the retrieval of a document with a > >> single > >>>>>>>>>>> range read, > >>>>>>>>>>>>>>>> which is a good goal in my opinion. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I am not sure I agree. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Think of bitTorrent, a single range read should pull back > >> the > >>>>>>>>>>> structure of > >>>>>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the > >> whole > >>>>>>>>>>> document. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> What if you already have a bunch of pieces in common with > >> other > >>>>>>>>>>> documents > >>>>>>>>>>>>>>> locally (a repeated header/footer/ or type for example); > >> and you > >>>>>>>>>>> only need > >>>>>>>>>>>>>>> to get a few pieces of data you don't already have? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The real goal to Couch I see is to treat your document set > >> like the > >>>>>>>>>>>>>>> collection of structured information that it is. In some > >> respects > >>>>>>>>>>> like an > >>>>>>>>>>>>>>> extension of your application's heap space for structured > >> objects > >>>>>>>>>>> and > >>>>>>>>>>>>>>> efficiently querying that collection to get back subsets of > >> the > >>>>>>>>>>> data. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file > >> system plus > >>>>>>>>>>> a fancy > >>>>>>>>>>>>>>> grep/find like feature... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The best way I see to unlock more features/power is to a > >> move > >>>>>>>>>>> towards a > >>>>>>>>>>>>>>> more granular and efficient way to store and retrieve the > >> scalar > >>>>>>>>>>> values... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> For example, hears a crazy thought: > >>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance > >> through a > >>>>>>>>>>> crypto hash > >>>>>>>>>>>>>>> function to get a set of hashes. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in > >> FDB. > >>>>>>>>>>> These > >>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend > >>>>>>>>>>> themselves to > >>>>>>>>>>>>>>> range search well. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in > >> the > >>>>>>>>>>> same set. > >>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that > >> bucket id > >>>>>>>>>>> becomes a > >>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more > >> than one > >>>>>>>>>>> bucket. > >>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of > >>>>>>>>>>> cryptohash values > >>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding > >> {val_id} (the > >>>>>>>>>>> id with > >>>>>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if > >> there's > >>>>>>>>>>> always a > >>>>>>>>>>>>>>> lot of app local cache misses. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up > >> to 100k > >>>>>>>>>>> per > >>>>>>>>>>>>>>> segment. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the > >> distinct > >>>>>>>>>>> buckets > >>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query > >> each > >>>>>>>>>>> bucket while > >>>>>>>>>>>>>>> reconstructing the document. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> The values returned from the buckets query are the key/value > >>>>>>>>>>> strings > >>>>>>>>>>>>>>> required to reassemble this document. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ---------- > >>>>>>>>>>>>>>> I put this forward primarily to hilite the idea that trying > >> to > >>>>>>>>>>> match the > >>>>>>>>>>>>>>> storage representation of documents in a straight forward > >> way to > >>>>>>>>>>> FDB keys > >>>>>>>>>>>>>>> to reduce query count might not be the most performance > >> oriented > >>>>>>>>>>> approach. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> I'd much prefer a storage approach that reduced data > >> duplication > >>>>>>>>>>> and > >>>>>>>>>>>>>>> enabled fast sub-document queries. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This clearly falls in the realm of what people want the > >> 'use case' > >>>>>>>>>>> of Couch > >>>>>>>>>>>>>>> to be/become. By giving Couch more access to sub-document > >>>>>>>>>>> queries, I could > >>>>>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted > >> to > >>>>>>>>>>> Couch and > >>>>>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple > >> documents in a > >>>>>>>>>>> single > >>>>>>>>>>>>>>> application layer request. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hehe - one way to look at the database of Couch documents > >> is that > >>>>>>>>>>> they are > >>>>>>>>>>>>>>> all conflict revisions of the single root empty document. > >> What I > >>>>>>>>>>> mean be > >>>>>>>>>>>>>>> this is consider thinking of the entire document store as > >> one > >>>>>>>>>>> giant DAG of > >>>>>>>>>>>>>>> key/value pairs. How even separate documents are still > >> typically > >>>>>>>>>>> related to > >>>>>>>>>>>>>>> each other. For most applications there is a tremendous > >> amount of > >>>>>>>>>>> data > >>>>>>>>>>>>>>> redundancy between docs and especially between revisions of > >> those > >>>>>>>>>>> docs... > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> And all this is a long way of saying "I think there could > >> be a lot > >>>>>>>>>>> of value > >>>>>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries > >> to > >>>>>>>>>>> FDB, with > >>>>>>>>>>>>>>> local caching, instead of simply retrieved" > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this > >> way!? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Mike :-) > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>> > >>> > >> > >