Heya Bob, this sounds like a good path forward that allows us to validate the remaining pieces of this.
Just two side notes: - do we expect that the map-keys-in-fdb approach would sort differently from map-keys-in-ebtree? (I assume no) - I further assume that we’ll find a nicer UX for “make sure my views are encrypted” than “make sure your view has a reduce fun”, no details or drafts necessary at this point. Best Jan — > On 26. Jul 2020, at 14:13, Robert Samuel Newson <rnew...@apache.org> wrote: > > I'd like to find consensus on a path forward and, with that in mind, I have a > suggestion. > > Before that, I think it's useful to lay out the points of contention, which > I've ranked in I hope an uncontroversial order; > > The principal concern with using ebtree for map as well as reduce is that it > precludes the use of foundationdb's ordering, on which much performance and > scalability likely hinges. I agree with this on principle (that is, its > demonstrably true) though none of us yet know the extent to which it matters > in practice, or whether any gap could be closed with optimization. > > The secondary concern is the proposed use of ebtree for reduce-only where it > receives only the reduce values from a document. I understand garren's stated > desire to avoid duplication with the map side but I don't think it works out. > Users of the reduce function are using it to get the reductions _across_ > documents. It is rare (that is, I don't recall seeing an example by a > cloudant customer) to emit multiple rows in a map function for the same key. > In the cases where it doesn't, the value inserted into ebtree is the same as > the map value anyway. Where it does happen, we're still duplicating the key > and _a_ document level value. It seems a very modest saving of space for a > more complicated design (two separate levels of reduce calculated in two > separate modules). > > The tertiary concern is that the map-only indexes place user data outside of > aegis (native database encryption) as, necessarily, the order of keys is > important and a straightforward encryption there would not preserve it. > > Looking at all the options available I don't think there is a single path > that addresses everything. There are also significant unknowns around > practical performance for all of this. > > So, I suggest that the map-only code (specifically, the code we will use for > an index definition that does not specify a reduce function) uses the > existing couch_views approach, where the fdb keys are suffixed with the > emitted key from the map function, and the fdb values are the emitted values. > This vertical index should support the fastest ?key= and startkey/endkey= > queries that fdb is capable of. There might be simplifications we can make to > this code if we accept that it only needs to handle map-only indexes. > > For index definitions that _do_ specify a reduce function, I suggest this is > done entirely with ebtree.We would call ebtree:insert(Db, Tree, Key, Value) > where Key and Value are the emitted key and value from the map function. > ebtree:open is handed a reduce_fun which handles the reduce and rereduce for > the users reduction and, like couchdb 1/2/3, we also mix in a system > reduction that calculates the view size (paul's diff above appears to do just > this portion and looks broadly right). > > This avoids the duplication that garren's idea of storing the reduces in > ebtree was trying to avoid, but does so in all cases. This approach allows us > to compare the performance of map queries against map-only and map-reduce > indexes, it allows users to opt in or out of encrypting their emitted view > keys, and it doesn't prevent anyone from choosing to do both (You can define > a view with map function A and another view with map function A and reduce > function B, the former will be the couch_views vertical index, the latter an > ebtree "horizontal" index). > > As we learn more over time we could enhance one type of index or the other > and reach a point where we eliminate one, or the two could coexist > indefinitely. > >> On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com> wrote: >> >> FWIW, a first pass at views entirely on ebtree turned out to be fairly >> straightforward. Almost surprisingly simple in some cases. >> >> https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views >> >> Its currently passing all tests in `couch_views_map_test.erl` but is >> failing on other suites. I only did a quick skim on the failures but >> they all look superficial around some APIs I changed. >> >> I haven't added the APIs to query reduce functions via HTTP but the >> reduce functions are being executed to calculate row counts and KV >> sizes. Adding the builtin reduce functions and extending those to user >> defined reduce functions should be straightforward. >> >> On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org> wrote: >>> >>> >>> I’m happy to restrict my PR comments to the actual diff, yes. So I’m not +1 >>> yet. >>> >>> I fixed the spurious conflicts at >>> https://github.com/apache/couchdb/pull/3033. >>> >>> -- >>> Robert Samuel Newson >>> rnew...@apache.org >>> >>> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote: >>>> Ok so just to confirm, we keep my PR as-is with ebtree only for reduce. We >>>> can get that ready to merge into fdb master. We can then use that to battle >>>> test ebtree and then look at using it for the map side as well. At that >>>> point we would combine the reduce and map index into a ebtree index. Are >>>> you happy with that? >>>> >>>> Cheers >>>> Garren >>>> >>>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org> wrote: >>>> >>>>> Hi, >>>>> >>>>> It’s not as unknown as you think but certainly we need empirical data to >>>>> guide us on the reduce side. I’m also fine with continuing with the >>>>> map-only code as it stands today until such time as we demonstrate ebtree >>>>> meets or exceeds our needs (and I freely accept the possibility that it >>>>> might not). >>>>> >>>>> I think the principal enhancement to ebtree that would address most >>>>> concerns is if it could store the leaf entries vertically (as they are >>>>> currently). I have some thoughts which I’ll try to realise as working >>>>> code. >>>>> >>>>> I’ve confirmed that I do create spurious conflicts and will have a PR up >>>>> today to fix that. >>>>> >>>>> -- >>>>> Robert Samuel Newson >>>>> rnew...@apache.org >>>>> >>>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote: >>>>>> Hi Bob, >>>>>> >>>>>> Thanks for that explanation, that is really helpful and it is good we >>>>> have >>>>>> some options but it also does highlight a lot of unknowns. Whereas our >>>>>> current map indexes are really simple and we know its behaviour. There >>>>> are >>>>>> no real unknowns. Adding ebtree here could make map indexes a fair bit >>>>> more >>>>>> complicated since we don't know the effect of managing different node >>>>>> sizes, concurrent doc updates, and querying performance. >>>>>> >>>>>> Could we do a compromise here, could we look at using ebtree only for >>>>>> reduces now? That means that we can have reduce indexes working quite >>>>> soon >>>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run >>>>>> performance tests on ebtree for map indexes. Some interesting tests we >>>>> can >>>>>> do is see if a user emits KV's near the limits (8KB for keys and 50KB for >>>>>> values) how does ebtree handle that? How it handles being updated in the >>>>>> doc update transaction. And general query performance. What do you think? >>>>>> >>>>>> Cheers >>>>>> Garren >>>>>> >>>>>> >>>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson < >>>>> rnew...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> A short preface at this stage is needed I think: >>>>>>> >>>>>>> My goal with ebtree was to implement a complete and correct b+tree that >>>>>>> also calculated and maintained inner reductions. I consciously chose >>>>> not to >>>>>>> go further than that before presenting it to the group for wider debate >>>>>>> (indeed, the very debate we're having in this thread). >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Ebtree is not single writer, at least not inherently. Two updates to >>>>> the >>>>>>> same ebtree should both succeed as long as they don't modify the same >>>>> nodes. >>>>>>> >>>>>>> Modifying the same node is likely where there's a reduce function, >>>>> though >>>>>>> only if the reduction value actually changes, as that percolates up the >>>>>>> tree. Where the reduce value does not change and when neither >>>>> transaction >>>>>>> causes a split, rebalance, or merge that affects the other, they should >>>>>>> both commit without conflict. There is a rich history of optimizations >>>>> in >>>>>>> this space specifically around btrees. ebtree might cause spurious >>>>>>> conflicts today, I will investigate and propose fixes if so (e.g, I >>>>> think I >>>>>>> probably call erlfdb:set on nodes that have not changed). >>>>>>> >>>>>>> I certainly envisaged updating ebtree within the same txn as a doc >>>>> update, >>>>>>> which is why the first argument to all the public functions can be >>>>> either >>>>>>> an erlfdb Db or open Tx. >>>>>>> >>>>>>> Parallelising the initial build is more difficult with ebtree than >>>>>>> parallelising the existing couch_views map-only code, though it's worth >>>>>>> noting that ebtree:insert benefits a great deal from batching (multiple >>>>>>> calls to :insert in the same transaction). For an offline build (i.e, >>>>> an >>>>>>> index that the client cannot see until the entire build is complete) >>>>> the >>>>>>> batch size can be maximised. That is still a serial process in that >>>>> there >>>>>>> is only one transaction at a time updating the ebtree. I can't say >>>>> offhand >>>>>>> how fast that is in practice, but it is clearly less powerful than a >>>>> fully >>>>>>> parallelisable approach could be. >>>>>>> >>>>>>> Any parallel build would require a way to divide the database into >>>>>>> non-overlapping subsets of emitted keys. This is easy and natural if >>>>> the >>>>>>> fdb key is the emitted key, which is the case for the couch_views >>>>> map-only >>>>>>> code. For ebtree it might be enough to simply grab a large chunk of >>>>>>> documents, perform the map transform, and then issues multiple >>>>> transactions >>>>>>> on subsets of those. >>>>>>> >>>>>>> Another common technique for btrees is bulk loading (more or less >>>>>>> literally constructing the btree nodes directly from the source, as >>>>> long as >>>>>>> you can sort it), which might be an option as well. >>>>>>> >>>>>>> Parallelising a build _with_ a reduce function seems hard however we do >>>>>>> it. The non-ebtree approach is parallelisable by virtue of paring down >>>>> the >>>>>>> reduce functionality itself (only whole key groups, only those >>>>> functions >>>>>>> that fdb has atomic operations for). >>>>>>> >>>>>>> I will first of all verify the multi-writer nature of ebtree as it >>>>> stands >>>>>>> today and make a PR which fixes any spurious conflicts, and then ponder >>>>>>> further how true parallel builds might be possible. >>>>>>> >>>>>>> >>>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org> wrote: >>>>>>>> >>>>>>>> We haven't spoken much about updates with ebtree. >>>>>>>> From my understanding of ebtree it can only do single writer, is that >>>>>>>> correct?. If that is true it means we would not be able to >>>>> parallelize >>>>>>> the >>>>>>>> background building of views to speed up view builds. >>>>>>>> The other thing that would mean is that we cannot use it for mango >>>>> where >>>>>>> we >>>>>>>> update the view in the doc transaction. >>>>>>>> >>>>>>>> Cheers >>>>>>>> Garren >>>>>>>> >>>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <kjsnav...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps and >>>>> we'll >>>>>>> see >>>>>>>>> if it comes close to the crazy fast KV map queries. >>>>>>>>> >>>>>>>>> Kyle >>>>>>>>> >>>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson < >>>>> rnew...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old >>>>> CouchDB >>>>>>> or >>>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my mind, >>>>>>> force >>>>>>>>> us >>>>>>>>>> down any particular path. >>>>>>>>>> >>>>>>>>>> I haven't sat down to modify couch_views in the manner I've >>>>> suggested >>>>>>>>>> (where ebtree is used as designed; being fed the output of the >>>>> emit() >>>>>>>>> calls >>>>>>>>>> and calculating reductions as it does so) but I think it's a >>>>> worthwhile >>>>>>>>>> exercise. I'd be surprised if performance of map-only traversals >>>>> would >>>>>>> be >>>>>>>>>> disappointing but who knows? I also expect it would allow for >>>>>>> significant >>>>>>>>>> simplification of the code, which is one of the higher virtues. >>>>>>>>>> >>>>>>>>>> Adam, can you describe in a little more detail how you picture >>>>> "b+tree >>>>>>> is >>>>>>>>>> only used for incremental aggregations,"? It's implied in your >>>>> reply >>>>>>> that >>>>>>>>>> it would preserve the "interesting property" of keeping user data >>>>> out >>>>>>> of >>>>>>>>>> FDB Keys (for casual readers: the new native database encryption, >>>>>>> called >>>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the key as >>>>> this >>>>>>>>>> would change the order of keys, which the current code depends on). >>>>>>> Did I >>>>>>>>>> misread you? >>>>>>>>>> >>>>>>>>>> B. >>>>>>>>>> >>>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski <kocol...@apache.org> >>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that >>>>> confident >>>>>>>>> I >>>>>>>>>> understood the design :) >>>>>>>>>>> >>>>>>>>>>> I like the idea that the b+tree is only used for incremental >>>>>>>>>> aggregations, rather than storing the entire materialized view, >>>>> for the >>>>>>>>>> same reasons that Garren stated. >>>>>>>>>>> >>>>>>>>>>> An index maintained entirely in ebtree has the interesting >>>>> property >>>>>>>>> that >>>>>>>>>> it does not leak any user data into FDB Keys, which could be >>>>> attractive >>>>>>>>> for >>>>>>>>>> security reasons. >>>>>>>>>>> >>>>>>>>>>> Adam >>>>>>>>>>> >>>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith <gar...@apache.org> >>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis < >>>>>>>>> paul.joseph.da...@gmail.com >>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce index. >>>>>>>>>>>>> >>>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't done >>>>> any >>>>>>>>>>>>> experiments on my own to understand if I'm missing something >>>>>>>>>>>>> important. My initial reaction is to not diverge too far from >>>>> the >>>>>>>>>>>>> previous shape of the implementation since we have a decent >>>>> idea of >>>>>>>>>>>>> how that behaves already but perhaps you've seen or measured >>>>>>>>> something >>>>>>>>>>>>> I'm not thinking of? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I think this must have been a misunderstanding on my part. I >>>>> always >>>>>>>>>> thought >>>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use it >>>>> for the >>>>>>>>>> map >>>>>>>>>>>> index. >>>>>>>>>>>> I don't like the idea that we have ordered distributed key/value >>>>>>> store >>>>>>>>>> and >>>>>>>>>>>> then implementing a b-tree on top of that for indexing. >>>>> Especially >>>>>>>>>> since we >>>>>>>>>>>> know that the current map index is fast, >>>>>>>>>>>> easy to use, and follows recommended practices in the FDB >>>>> community >>>>>>> on >>>>>>>>>> how >>>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a means to >>>>> an >>>>>>> end >>>>>>>>>> to >>>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a >>>>> b-tree, >>>>>>>>> with >>>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry we >>>>> are >>>>>>>>> tying >>>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact that >>>>> we >>>>>>>>>> moving >>>>>>>>>>>> to FDB which then allows us to design new api's and >>>>> functionality. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >