Re: [DISCUSS] Reduce on FDB take 3

Jan Lehnardt Sun, 26 Jul 2020 05:28:22 -0700

Heya Bob,

this sounds like a good path forward that allows us to validate the remaining 
pieces of this.


Just two side notes:

- do we expect that the map-keys-in-fdb approach would sort differently from 
map-keys-in-ebtree? (I assume no)
- I further assume that we’ll find a nicer UX for “make sure my views are 
encrypted” than “make sure your view has a reduce fun”, no details or drafts 
necessary at this point.

Best
Jan
—

> On 26. Jul 2020, at 14:13, Robert Samuel Newson <rnew...@apache.org> wrote:
> 
> I'd like to find consensus on a path forward and, with that in mind, I have a 
> suggestion.
> 
> Before that, I think it's useful to lay out the points of contention, which 
> I've ranked in I hope an uncontroversial order;
> 
> The principal concern with using ebtree for map as well as reduce is that it 
> precludes the use of foundationdb's ordering, on which much performance and 
> scalability likely hinges. I agree with this on principle (that is, its 
> demonstrably true) though none of us yet know the extent to which it matters 
> in practice, or whether any gap could be closed with optimization.
> 
> The secondary concern is the proposed use of ebtree for reduce-only where it 
> receives only the reduce values from a document. I understand garren's stated 
> desire to avoid duplication with the map side but I don't think it works out. 
> Users of the reduce function are using it to get the reductions _across_ 
> documents. It is rare (that is, I don't recall seeing an example by a 
> cloudant customer) to emit multiple rows in a map function for the same key. 
> In the cases where it doesn't, the value inserted into ebtree is the same as 
> the map value anyway. Where it does happen, we're still duplicating the key 
> and _a_ document level value. It seems a very modest saving of space for a 
> more complicated design (two separate levels of reduce calculated in two 
> separate modules).
> 
> The tertiary concern is that the map-only indexes place user data outside of 
> aegis (native database encryption) as, necessarily, the order of keys is 
> important and a straightforward encryption there would not preserve it.
> 
> Looking at all the options available I don't think there is a single path 
> that addresses everything. There are also significant unknowns around 
> practical performance for all of this.
> 
> So, I suggest that the map-only code (specifically, the code we will use for 
> an index definition that does not specify a reduce function) uses the 
> existing couch_views approach, where the fdb keys are suffixed with the 
> emitted key from the map function, and the fdb values are the emitted values. 
> This vertical index should support the fastest ?key= and startkey/endkey= 
> queries that fdb is capable of. There might be simplifications we can make to 
> this code if we accept that it only needs to handle map-only indexes.
> 
> For index definitions that _do_ specify a reduce function, I suggest this is 
> done entirely with ebtree.We would call ebtree:insert(Db, Tree, Key, Value) 
> where Key and Value are the emitted key and value from the map function. 
> ebtree:open is handed a reduce_fun which handles the reduce and rereduce for 
> the users reduction and, like couchdb 1/2/3, we also mix in a system 
> reduction that calculates the view size (paul's diff above appears to do just 
> this portion and looks broadly right).
> 
> This avoids the duplication that garren's idea of storing the reduces in 
> ebtree was trying to avoid, but does so in all cases. This approach allows us 
> to compare the performance of map queries against map-only and map-reduce 
> indexes, it allows users to opt in or out of encrypting their emitted view 
> keys, and it doesn't prevent anyone from choosing to do both (You can define 
> a view with map function A and another view with map function A and reduce 
> function B, the former will be the couch_views vertical index, the latter an 
> ebtree "horizontal" index).
> 
> As we learn more over time we could enhance one type of index or the other 
> and reach a point where we eliminate one, or the two could coexist 
> indefinitely.
> 
>> On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com> wrote:
>> 
>> FWIW, a first pass at views entirely on ebtree turned out to be fairly
>> straightforward. Almost surprisingly simple in some cases.
>> 
>> https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views
>> 
>> Its currently passing all tests in `couch_views_map_test.erl` but is
>> failing on other suites. I only did a quick skim on the failures but
>> they all look superficial around some APIs I changed.
>> 
>> I haven't added the APIs to query reduce functions via HTTP but the
>> reduce functions are being executed to calculate row counts and KV
>> sizes. Adding the builtin reduce functions and extending those to user
>> defined reduce functions should be straightforward.
>> 
>> On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org> wrote:
>>> 
>>> 
>>> I’m happy to restrict my PR comments to the actual diff, yes. So I’m not +1 
>>> yet.
>>> 
>>> I fixed the spurious conflicts at 
>>> https://github.com/apache/couchdb/pull/3033.
>>> 
>>> --
>>> Robert Samuel Newson
>>> rnew...@apache.org
>>> 
>>> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote:
>>>> Ok so just to confirm, we keep my PR as-is with ebtree only for reduce. We
>>>> can get that ready to merge into fdb master. We can then use that to battle
>>>> test ebtree and then look at using it for the map side as well. At that
>>>> point we would combine the reduce and map index into a ebtree index. Are
>>>> you happy with that?
>>>> 
>>>> Cheers
>>>> Garren
>>>> 
>>>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> It’s not as unknown as you think but certainly we need empirical data to
>>>>> guide us on the reduce side. I’m also fine with continuing with the
>>>>> map-only code as it stands today until such time as we demonstrate ebtree
>>>>> meets or exceeds our needs (and I freely accept the possibility that it
>>>>> might not).
>>>>> 
>>>>> I think the principal enhancement to ebtree that would address most
>>>>> concerns is if it could store the leaf entries vertically (as they are
>>>>> currently). I have some thoughts which I’ll try to realise as working 
>>>>> code.
>>>>> 
>>>>> I’ve confirmed that I do create spurious conflicts and will have a PR up
>>>>> today to fix that.
>>>>> 
>>>>> --
>>>>> Robert Samuel Newson
>>>>> rnew...@apache.org
>>>>> 
>>>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote:
>>>>>> Hi Bob,
>>>>>> 
>>>>>> Thanks for that explanation, that is really helpful and it is good we
>>>>> have
>>>>>> some options but it also does highlight a lot of unknowns. Whereas our
>>>>>> current map indexes are really simple and we know its behaviour. There
>>>>> are
>>>>>> no real unknowns. Adding ebtree here could make map indexes a fair bit
>>>>> more
>>>>>> complicated since we don't know the effect of managing different node
>>>>>> sizes, concurrent doc updates, and querying performance.
>>>>>> 
>>>>>> Could we do a compromise here, could we look at using ebtree only for
>>>>>> reduces now? That means that we can have reduce indexes working quite
>>>>> soon
>>>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run
>>>>>> performance tests on ebtree for map indexes. Some interesting tests we
>>>>> can
>>>>>> do is see if a user emits KV's near the limits (8KB for keys and 50KB for
>>>>>> values) how does ebtree handle that? How it handles being updated in the
>>>>>> doc update transaction. And general query performance. What do you think?
>>>>>> 
>>>>>> Cheers
>>>>>> Garren
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson <
>>>>> rnew...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> A short preface at this stage is needed I think:
>>>>>>> 
>>>>>>> My goal with ebtree was to implement a complete and correct b+tree that
>>>>>>> also calculated and maintained inner reductions. I consciously chose
>>>>> not to
>>>>>>> go further than that before presenting it to the group for wider debate
>>>>>>> (indeed, the very debate we're having in this thread).
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> Ebtree is not single writer, at least not inherently. Two updates to
>>>>> the
>>>>>>> same ebtree should both succeed as long as they don't modify the same
>>>>> nodes.
>>>>>>> 
>>>>>>> Modifying the same node is likely where there's a reduce function,
>>>>> though
>>>>>>> only if the reduction value actually changes, as that percolates up the
>>>>>>> tree. Where the reduce value does not change and when neither
>>>>> transaction
>>>>>>> causes a split, rebalance, or merge that affects the other, they should
>>>>>>> both commit without conflict. There is a rich history of optimizations
>>>>> in
>>>>>>> this space specifically around btrees. ebtree might cause spurious
>>>>>>> conflicts today, I will investigate and propose fixes if so (e.g, I
>>>>> think I
>>>>>>> probably call erlfdb:set on nodes that have not changed).
>>>>>>> 
>>>>>>> I certainly envisaged updating ebtree within the same txn as a doc
>>>>> update,
>>>>>>> which is why the first argument to all the public functions can be
>>>>> either
>>>>>>> an erlfdb Db or open Tx.
>>>>>>> 
>>>>>>> Parallelising the initial build is more difficult with ebtree than
>>>>>>> parallelising the existing couch_views map-only code, though it's worth
>>>>>>> noting that ebtree:insert benefits a great deal from batching (multiple
>>>>>>> calls to :insert in the same transaction). For an offline build (i.e,
>>>>> an
>>>>>>> index that the client cannot see until the entire build is complete)
>>>>> the
>>>>>>> batch size can be maximised. That is still a serial process in that
>>>>> there
>>>>>>> is only one transaction at a time updating the ebtree. I can't say
>>>>> offhand
>>>>>>> how fast that is in practice, but it is clearly less powerful than a
>>>>> fully
>>>>>>> parallelisable approach could be.
>>>>>>> 
>>>>>>> Any parallel build would require a way to divide the database into
>>>>>>> non-overlapping subsets of emitted keys. This is easy and natural if
>>>>> the
>>>>>>> fdb key is the emitted key, which is the case for the couch_views
>>>>> map-only
>>>>>>> code. For ebtree it might be enough to simply grab a large chunk of
>>>>>>> documents, perform the map transform, and then issues multiple
>>>>> transactions
>>>>>>> on subsets of those.
>>>>>>> 
>>>>>>> Another common technique for btrees is bulk loading (more or less
>>>>>>> literally constructing the btree nodes directly from the source, as
>>>>> long as
>>>>>>> you can sort it), which might be an option as well.
>>>>>>> 
>>>>>>> Parallelising a build _with_ a reduce function seems hard however we do
>>>>>>> it. The non-ebtree approach is parallelisable by virtue of paring down
>>>>> the
>>>>>>> reduce functionality itself (only whole key groups, only those
>>>>> functions
>>>>>>> that fdb has atomic operations for).
>>>>>>> 
>>>>>>> I will first of all verify the multi-writer nature of ebtree as it
>>>>> stands
>>>>>>> today and make a PR which fixes any spurious conflicts, and then ponder
>>>>>>> further  how true parallel builds might be possible.
>>>>>>> 
>>>>>>> 
>>>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org> wrote:
>>>>>>>> 
>>>>>>>> We haven't spoken much about updates with ebtree.
>>>>>>>> From my understanding of ebtree it can only do single writer, is that
>>>>>>>> correct?. If that is true it means we would not be able to
>>>>> parallelize
>>>>>>> the
>>>>>>>> background building of views to speed up view builds.
>>>>>>>> The other thing that would mean is that we cannot use it for mango
>>>>> where
>>>>>>> we
>>>>>>>> update the view in the doc transaction.
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Garren
>>>>>>>> 
>>>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <kjsnav...@gmail.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps and
>>>>> we'll
>>>>>>> see
>>>>>>>>> if it comes close to the crazy fast KV map queries.
>>>>>>>>> 
>>>>>>>>> Kyle
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson <
>>>>> rnew...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old
>>>>> CouchDB
>>>>>>> or
>>>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my mind,
>>>>>>> force
>>>>>>>>> us
>>>>>>>>>> down any particular path.
>>>>>>>>>> 
>>>>>>>>>> I haven't sat down to modify couch_views in the manner I've
>>>>> suggested
>>>>>>>>>> (where ebtree is used as designed; being fed the output of the
>>>>> emit()
>>>>>>>>> calls
>>>>>>>>>> and calculating reductions as it does so) but I think it's a
>>>>> worthwhile
>>>>>>>>>> exercise. I'd be surprised if performance of map-only traversals
>>>>> would
>>>>>>> be
>>>>>>>>>> disappointing but who knows? I also expect it would allow for
>>>>>>> significant
>>>>>>>>>> simplification of the code, which is one of the higher virtues.
>>>>>>>>>> 
>>>>>>>>>> Adam, can you describe in a little more detail how you picture
>>>>> "b+tree
>>>>>>> is
>>>>>>>>>> only used for incremental aggregations,"? It's implied in your
>>>>> reply
>>>>>>> that
>>>>>>>>>> it would preserve the "interesting property" of keeping user data
>>>>> out
>>>>>>> of
>>>>>>>>>> FDB Keys (for casual readers: the new native database encryption,
>>>>>>> called
>>>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the key as
>>>>> this
>>>>>>>>>> would change the order of keys, which the current code depends on).
>>>>>>> Did I
>>>>>>>>>> misread you?
>>>>>>>>>> 
>>>>>>>>>> B.
>>>>>>>>>> 
>>>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski <kocol...@apache.org>
>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that
>>>>> confident
>>>>>>>>> I
>>>>>>>>>> understood the design :)
>>>>>>>>>>> 
>>>>>>>>>>> I like the idea that the b+tree is only used for incremental
>>>>>>>>>> aggregations, rather than storing the entire materialized view,
>>>>> for the
>>>>>>>>>> same reasons that Garren stated.
>>>>>>>>>>> 
>>>>>>>>>>> An index maintained entirely in ebtree has the interesting
>>>>> property
>>>>>>>>> that
>>>>>>>>>> it does not leak any user data into FDB Keys, which could be
>>>>> attractive
>>>>>>>>> for
>>>>>>>>>> security reasons.
>>>>>>>>>>> 
>>>>>>>>>>> Adam
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith <gar...@apache.org>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis <
>>>>>>>>> paul.joseph.da...@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce index.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't done
>>>>> any
>>>>>>>>>>>>> experiments on my own to understand if I'm missing something
>>>>>>>>>>>>> important. My initial reaction is to not diverge too far from
>>>>> the
>>>>>>>>>>>>> previous shape of the implementation since we have a decent
>>>>> idea of
>>>>>>>>>>>>> how that behaves already but perhaps you've seen or measured
>>>>>>>>> something
>>>>>>>>>>>>> I'm not thinking of?
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I think this must have been a misunderstanding on my part. I
>>>>> always
>>>>>>>>>> thought
>>>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use it
>>>>> for the
>>>>>>>>>> map
>>>>>>>>>>>> index.
>>>>>>>>>>>> I don't like the idea that we have ordered distributed key/value
>>>>>>> store
>>>>>>>>>> and
>>>>>>>>>>>> then implementing a b-tree on top of that for indexing.
>>>>> Especially
>>>>>>>>>> since we
>>>>>>>>>>>> know that the current map index is fast,
>>>>>>>>>>>> easy to use, and follows recommended practices in the FDB
>>>>> community
>>>>>>> on
>>>>>>>>>> how
>>>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a means to
>>>>> an
>>>>>>> end
>>>>>>>>>> to
>>>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a
>>>>> b-tree,
>>>>>>>>> with
>>>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry we
>>>>> are
>>>>>>>>> tying
>>>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact that
>>>>> we
>>>>>>>>>> moving
>>>>>>>>>>>> to FDB which then allows us to design new api's and
>>>>> functionality.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>

Re: [DISCUSS] Reduce on FDB take 3

Reply via email to