Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Robert Samuel Newson Thu, 24 Jan 2019 02:21:34 -0800

Hi,

Thank you for the in-depth response, that’s exactly what the PMC is looking for.


You are comprehending the nature and magnitude of the change correctly here, 
where you suggest we could “just” write a new CouchDB Layer on top of 
FoundationDB and achieve a similar effect. However, the nature of software and 
software development really speaks against doing it that way, in my opinion. In 
2.0 we introduced an abstraction between the HTTP processing layer and the 
lower plumbing of b-trees and file I/O with the “fabric” application. This was 
essential to introduce clustering but it was a significant architectural 
improvement in its own right. By reimplementing below that line we can be more 
confident that we have preserved all the necessary parts of the CouchDB API and 
experience. Additionally, separate applications like the replicator and job 
scheduler can remain as they are. A lot of the existing code will remain as-is, 
or have minor changes or cleanup (the “local” mode for replication, unreachable 
since 2.0, can finally be excised, for example).

To your other point, I remember the difficulty I first had when looking at 
CouchDB. It’s in Erlang, which I’d not used before, and there is a lot of 
subtle and tricky code at the lower tiers (see couch_key_tree.erl or 
couch_btree.erl). By using FoundationDB for that instead I hope we _increase_ 
the comprehensibility of CouchDB, as what remains will be its essential nature 
and not the important but ancillary plumbing below. The increased public 
development activity on CouchDB, the size of the ambition here, and some 
cross-pollinating interest from those who know or are interested in 
FoundationDB should, I hope, bring more active developers of all levels of 
experience and interest to our project.

B.

> On 23 Jan 2019, at 23:27, Michael Fair <mich...@daclubhouse.net> wrote:
> 
> As someone who isn't as directly involved in the release-to-release
> development, would a move like this make it easier or harder for new/casual
> community members to get up to speed/understand what's going on?
> 
> As projects grow and mature, the introductory learning curve tends to get
> steeper, making it harder for people who didn't "grow up with the project"
> to grok the project as a whole thing.  Not complaining, just identifying.
> 
> Is this proposal suggesting something more akin to a storage layer
> separation (making it somewhat easier to identify the separate component
> layers and experiment with different backends) or more like a storage
> technology change (where any experimenter would first have to understand
> how FDB semantics are different from File I/O semantics)?
> 
> All in all it sounds like a promising proposal.
> My first thought was something like "Hmm, is this different than simply
> adding a 'Couch Replication Protocol' module to FoundationDB? Probably, or
> they wouldn't be proposing it this way"
> 
> Followed quickly by, "Okay, looks like I'll likely need to start learning
> FoundationDB now too if I really want to understand CouchDB's
> capabilities.  I've not really heard much/looked at it before..."
> 
> I don't think a new learning curve should dissuade people from adopting it,
> but as I haven't looked at the educational materials available, I can't
> speak to the level of "ownership" the general community would be able to
> keep.
> 
> My experience is, generally speaking, people simply avoid aspects of a
> project they don't feel competent in.  Leaving that work to those with
> stronger opinions/convictions/interest. And that the easier it is to
> independently "get up to speed" on that aspect of the project (reading a
> blog(s)/watching a video(s)/tracing code) the more likely an interested
> party is to contribute there.
> 
> It'd be great to find out that a consequence of this move makes it easier
> for interested people, still unfamiliar with CouchDB's internals, to get
> more involved because there were some great and easily accessible teaching
> materials...
> 
> This concept obviously isn't unique to this FDB proposal; nor is it
> advocating for or against; I guess it's just expressing a hope that the
> impact is made to also help those who would like to get started
> contributing to CouchDB in meaningful ways instead of them getting a new
> and more complicated third party tech dependency to go learn as well.
> 
> Mike
> 
> PS While I assume there's likely very clear answers, does this differ
> significantly from the idea of giving FoundationDB a Couch compatible web
> API interface?  Like instead of making FoundationDB "the storage backend"
> for Couch, why not add a Couch compatible web interface front end to
> FoundationDB?  Is there a lot of useful Couch code in between those two
> things?
> 
> 
> On Wed, Jan 23, 2019 at 12:20 PM Joan Touzet <woh...@apache.org> wrote:
> 
>> Hi everyone,
>> 
>> As Jan mentions, the PMC has had a couple of weeks to prepare on this.
>> 
>> As a non-IBMer (though an ex-IBM-er and ex-Cloudant-er), I've had my
>> Apache PMC hat on the entire time, considering all of the things
>> that Jan mentions and more. My primary concern has been ensuring that,
>> should this go forward, what happens occurs in the project's best
>> interest.
>> 
>> During the analysis process I came up with 8 serious topics that we
>> need to sort out:
>> 
>> * RFC process - how major changes are proposed/designed/accepted,
>>                see new GitHub template for a preview on this
>> 
>> * Bylaws review - namely, should we insist on +1s from outside
>>                  your company for big things? Plus RFC/deprecations.
>> 
>> * Roadmap - we have a roadmap from ~24 months ago that represented
>>            our goals for CouchDB 2.x and 3.x. What happens to it?
>>            https://s.apache.org/couch2xroadmap
>> 
>> * Onboarding - better mentoring in The Apache Way and The CouchDB
>>               Way for new members (from IBM and elsewhere)
>> 
>> * (Re-)Branding - how do we differentiate between "CouchDB Classic"
>>                  and "New CouchDB" in a succinct and clear way?
>> 
>> * FoundationDB - all the non-technical aspects. Review of _their_
>>                 project governance, cross-project pollination, us
>>                 learning the core and pros/cons, identifying who
>>                 will actually learn that code base, and operational
>>                 considerations. Also: keeping this knowledge public
>>                 and not just "inside IBM's dev/ops teams".
>> 
>> * Proj. Mgmt. - Obviously IBM will have a PM involved. We should too.
>>                Reviewing process/procedure and ensuring a smooth
>>                collaboration is critical. IBM doesn't get to just
>>                throw code over the wall at us. Similarly, should we
>>                choose to work on proposed features, or stuff from
>>                the roadmap, we need to be able to cooperate. No
>>                cookie licking allowed![*]
>> 
>> * Tech deep dives - this will actually be many, many threads I expect,
>>                    including everyone's favourite on release mgmt :P
>> 
>> New threads will be started on these topics by PMC members over the
>> coming days (but not all at once, so everyone has time to reflect and
>> respond.)
>> 
>> My initial take on the proposal: it's GOOD that we're finally
>> addressing some of the problems that 2.x brought to the table, and if
>> this is the best way to do so, then so be it. I want to know more
>> about the technical details, and I want to see a more formal RFC before
>> voting on it, though.
>> 
>> -Joan 'And now for something completely different...' Touzet
>> 
>> [*] http://communitymgt.wikia.com/wiki/Cookie_Licking
>> 
>> 
>> ----- Original Message -----
>>> From: "Jan Lehnardt" <j...@apache.org>
>>> To: "CouchDB Developers" <dev@couchdb.apache.org>
>>> Sent: Wednesday, January 23, 2019 8:33:30 AM
>>> Subject: Re: [DISCUSS] Rebase CouchDB on top of FoundationDB
>>> 
>>> Hi Bob,
>>> 
>>> this is all very exciting!
>>> 
>>> First up, full disclosure, the CouchDB PMC has had about two weeks to
>>> think about this already, so if any of the following doesn’t sound
>>> like a knee-jerk reaction, that’s why.
>>> 
>>> I’m personally tentatively optimistic about this proposal and I’m
>>> willing to work through all open questions from governance,
>>> contribution management to the technical bits to see if we as the
>>> CouchDB project arrive at a point where we are comfortable going
>>> down this path.
>>> 
>>> The PMC has already identified a set of discussion areas for this
>>> dev@ mailing list to go through before any definite decision can be
>>> made. Separate emails for those discussions are going to be posted
>>> on this list shortly, so I won’t go into further detail here.
>>> 
>>> If anyone sees a need for discussion beyond the threads that will
>>> appear here, please speak up at your earliest convenience. This
>>> proposal would mean a big step for our project, and we must make
>>> sure to hear all voices.
>>> 
>>> Once we’ve gone through all this, the resulting answers to all the
>>> open questions coming up will end up in a consensus finding process
>>> on this mailing list, which will signify the final project decision.
>>> 
>>> * * *
>>> 
>>> That said, I’d like to highlight one of these topics: IBM/Cloudant’s
>>> contributions going forward.
>>> 
>>> Looking at how 2.0 came to be, the contributions were mostly taken on
>>> good faith (and legal review), and from the trust Cloudant built up
>>> operating a large number of large instances of clusters of what
>>> would eventually become CouchDB 2.0. It has clearly paid off for
>>> CouchDB and our current level of success wouldn’t be without
>>> IBM/Cloudant.
>>> 
>>> However, some of the ways we work with the IBM team leave things to
>>> be desired. Specifically, the Apache CouchDB community is frequently
>>> not involved in design discussions around new features. Those happen
>>> inside IBM and we “only” get a PR that then goes through the regular
>>> review process. Again, this has served us well, but we can do even
>>> better, so I’d like to take the opportunity of this larger proposal
>>> to suggest we actually do better. As promised, a more detailed
>>> thread about this is going to come up, and it’ll be the right place
>>> to go through the minutiae of this.
>>> 
>>> With this structural change, I believe we are in a great position to
>>> work through the details of this proposal and the subsequent design
>>> and engineering steps.
>>> 
>>> * * *
>>> 
>>> Finally, I want to reiterate Bob’s point: while this proposal is
>>> largely driven by IBM, IBM has no power to unilaterally force the
>>> CouchDB project to accept this proposal and they have already
>>> signalled and worked towards making this a mutually beneficial
>>> endeavour. The CouchDB project has different objectives from IBM and
>>> it is up to us to come up with a proposal that satisfies all of our
>>> objectives as well as IBMs, should this motion pass.
>>> 
>>> Best
>>> Jan
>>> —
>>> 
>>> 
>>>> On 23. Jan 2019, at 11:00, Robert Samuel Newson
>>>> <rnew...@apache.org> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> CouchDB 2.0 introduced clustering; the ability to scale a single
>>>> database across multiple nodes, increasing both the maximum size
>>>> of a database and adding native fault-tolerance. This welcome and
>>>> considerable step forward was not without its trade-offs. In the
>>>> years since 2.0 was released, users frequently encounter the
>>>> following issues as a direct consequence of the 2.0 clustering
>>>> approach:
>>>> 
>>>> 1. Conflict revisions can be created on normal concurrent updates
>>>> issued to a single database, since each replica of a database
>>>> shard independently chooses whether to accept a given update, and
>>>> all replicas will eventually propagate updates that any one of
>>>> them has chosen to accept.
>>>> 2. Secondary indexes ("views") do not scale the same way as
>>>> document lookups, as they are sharded by doc id, not emitted view
>>>> key (thus forcing a consultation of all shard ranges for each
>>>> query).
>>>> 3. The changes feed is no longer totally ordered and, worse, could
>>>> replay earlier changes in the event of a node failure (even a
>>>> temporary one).
>>>> 
>>>> The idea is to use FoundationDB as the new CouchDB foundational
>>>> layer, letting it take care of data storage and placement. An
>>>> introduction to FoundationDB would take up too much space here so
>>>> I will summarise it as a highly scalable ordered key-value store
>>>> with transactional semantics, provides strong consistency, scaling
>>>> from a single node to many. It is licensed under the ASLv2 but is
>>>> not an Apache project.
>>>> 
>>>> By using FoundationDB we can solve all three of the problems listed
>>>> above and deliver semantics much closer to CouchDB 1.x's behaviour
>>>> while improving upon the scalability advantages that 2.0
>>>> introduced. The essential character of CouchDB would be preserved
>>>> (MVCC for documents, replication between CouchDB databases) but
>>>> the underlying plumbing would change significantly. In addition,
>>>> this new foundation will allow us to add long wished-for features
>>>> more easily. For example, multi-document transactions become
>>>> possible, as does efficient field-level reading and writing. A
>>>> further thought is the ability to update views transactionally
>>>> with the database update.
>>>> 
>>>> For those familiar with the CouchDB 2.0 architecture, the proposal
>>>> is, in effect, to change all the functions in fabric.erl so that
>>>> they work against a (possibly remote) FoundationDB cluster instead
>>>> of the current implementation of calling into the original CouchDB
>>>> 1.x code (couch_btree, couch_file, etc).
>>>> 
>>>> This is a large change and, for full disclosure, the IBM Cloudant
>>>> team are proposing it. We have done our due diligence in
>>>> investigating FoundationDB as well as detailed investigation into
>>>> how CouchDB semantics would be built on top of FoundationDB. Any
>>>> and all decisions on that must take place here on the CouchDB
>>>> developer mailing list, of course, but we are confident that this
>>>> is feasible.
>>>> During those investigations we have identified a small number of
>>>> CouchDB features that we do not yet see a way to do on
>>>> FoundationDB, the main one being custom (Javascript) reduces. This
>>>> is a direct consequence of no longer rolling our own persistence
>>>> layer (couch_btree and friends) and would likely apply to any
>>>> alternative technology.
>>>> 
>>>> I think this would be a great advance for CouchDB, preserving what
>>>> makes CouchDB special but taking advantage of the superbly
>>>> engineered FoundationDB software at the bottom of the stack.
>>>> 
>>>> Regards,
>>>> Robert Newson
>>> 
>>> --
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>>> 
>>

Re: [DISCUSS] Rebase CouchDB on top of FoundationDB

Reply via email to