Re: [VOTE] Merge BigCouch

Alexander Shorin Wed, 08 May 2013 12:47:32 -0700

+1
--
,,,^..^,,,


On Wed, May 8, 2013 at 11:20 PM, Garren Smith <[email protected]> wrote:
> +1
>
> On 08 May 2013, at 7:04 PM, Joan Touzet <[email protected]> wrote:
>
>> +1
>>
>> On Tue, May 07, 2013 at 04:02:09PM -0500, Paul Davis wrote:
>>> +1
>>>
>>> On Tue, May 7, 2013 at 3:52 PM, Russell Branca <[email protected]> wrote:
>>>> +1
>>>>
>>>>
>>>> Very excited to see this! Great work!
>>>>
>>>>
>>>> -Russell
>>>>
>>>>
>>>> On Tue, May 7, 2013 at 1:44 PM, Robert Newson <[email protected]> wrote:
>>>>
>>>>> FYI: A zip of this work is available at
>>>>> http://people.apache.org/~rnewson/dist/nebraska-merge-candidate.zip
>>>>> made by 'git archive -o nebraska-merge-candidate.zip
>>>>> nebraska-merge-candidate'
>>>>>
>>>>> On 7 May 2013 21:34, Robert Newson <[email protected]> wrote:
>>>>>> Hi All,
>>>>>>
>>>>>> I propose to merge in the following work,
>>>>>> https://github.com/rnewson/couchdb/tree/nebraska-merge-candidate to
>>>>>> the official Apache CouchDB repository to a new branch (i.e, *not*
>>>>>> master). Once there, the full CouchDB developer community can begin
>>>>>> the work to incorporate the code here into an official release.
>>>>>>
>>>>>> You do not need to respond if you are in agreement. If there is no
>>>>>> response in 72 hours, I will assume lazy consensus. If we reach
>>>>>> consensus, I will start the IP clearance process and then the merge.
>>>>>>
>>>>>> As most of you know, Paul Davis and I recently sequestered ourselves
>>>>>> away from society (in a place called Nebraska) to make this merge
>>>>>> happen. I want to clarify that this work is not the BigCouch code you
>>>>>> can see on github.com/cloudant/bigcouch but the Cloudant platform from
>>>>>> which BigCouch was made. This means it is bang up to date with all the
>>>>>> bug fixes and feature enhancements we've made in the last eighteen
>>>>>> months or more. With that clarification made, here are our notes about
>>>>>> what we achieved, what it means to the project and what isn't yet
>>>>>> done;
>>>>>>
>>>>>> Nebraska Merge Roundup
>>>>>>
>>>>>>
>>>>>> Stats:
>>>>>>
>>>>>>
>>>>>> 1402 - total new commits
>>>>>>
>>>>>> 312 - commits written during the merge (will be reduced substantially
>>>>>> by squashing)
>>>>>>
>>>>>> 408 - number of files changed
>>>>>>
>>>>>> 21,897 - number of lines added
>>>>>>
>>>>>> 4,277 - number of lines removed
>>>>>>
>>>>>> A retrospective:
>>>>>>
>>>>>> Bob Newson and I have come to the end of our merge sprint on getting
>>>>>> BigCouch merged into Apache CouchDB. Its been a productive ten days
>>>>>> here in the midwest. I managed to get Bob out to a bowling alley and
>>>>>> he managed to get me to a sushi restaurant. In between the cultural
>>>>>> exchanges we’ve also managed to get a significant amount of work done
>>>>>> on the merging as well.
>>>>>>
>>>>>>
>>>>>> The current status of the merge is that we’ve managed to resolve the
>>>>>> differences in the single node execution of CouchDB. Both the
>>>>>> JavaScript and Erlang test suites run with only one failure in the
>>>>>> Erlang test suite due to a (deliberately) missing constraint on the
>>>>>> number of operating system processes. This should be a relatively
>>>>>> straightforward fix but was not prioritized during our limited time to
>>>>>> work on the larger issues.
>>>>>>
>>>>>>
>>>>>> We merged a large number of performance and stability enhancements
>>>>>> back into single node CouchDB as well as a number of pure bug fixes.
>>>>>> The biggest highlight is a brand new compactor that is both faster and
>>>>>> creates smaller and better organized post-compaction databases.
>>>>>>
>>>>>>
>>>>>> The current status of the merge is that single node operations should
>>>>>> be completely unaffected as demonstrated by the test suite passing. On
>>>>>> the other hand we haven’t yet finished getting the clustered code
>>>>>> merged to use some of the new changes in single node CouchDB. The
>>>>>> single most significant portion of this work involves updates to the
>>>>>> internal cluster API for views to use the recently rewritten indexer
>>>>>> APIs. This should be a relatively straightforward bit of work that
>>>>>> we’ll be finishing over the next few weeks.
>>>>>>
>>>>>>
>>>>>> All in all the merge work done so far has been quite successful. We’ve
>>>>>> met our primary goal of getting the code merged in a fashion that does
>>>>>> not affect single node operation while providing a starting point for
>>>>>> the larger community to start reviewing the more significant changes
>>>>>> made. Given the size of the diff between the two code bases we never
>>>>>> expected to have a fully working clustered solution after ten days of
>>>>>> work but we have succeeded in providing a base of work that will allow
>>>>>> us and new contributors to get up to speed quickly.
>>>>>>
>>>>>>
>>>>>> This work, coupled with work by Dave Cottlehuber and Benoît Chesneau
>>>>>> on updating the build system and various other internal updates, will
>>>>>> provide a solid foundation for work going forward. Its an exciting
>>>>>> time for CouchDB and anyone interested should keep an eye on the next
>>>>>> few releases as we ramp up work on various core aspects of the
>>>>>> database.
>>>>>>
>>>>>>
>>>>>> We’ve had an exciting few days working to prepare the road for an
>>>>>> exciting next twelve to eighteen months. We hope that everyone will
>>>>>> feel as excited as we do about the next twelve to eighteen months for
>>>>>> Apache CouchDB. It should be an exciting ride.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Things we got done
>>>>>>
>>>>>>
>>>>>> * Large update to the source tree layout for Erlang applications. Each
>>>>>> application now has a src/appname/(c_src|ebin|priv|src) structure. The
>>>>>> build system has been updated.
>>>>>>
>>>>>> * Renamed src/couchdb to src/couch to match the Erlang convention of
>>>>>> the top directory name matching the Erlang application name.
>>>>>>
>>>>>> * Imported Cloudant Erlang applications for clustered CouchDB. These
>>>>>> are imported with their history by using git subtree and merging the
>>>>>> top level commit. These are not external deps, development will happen
>>>>>> within the CouchDB tree. The imported apps are:
>>>>>>
>>>>>>
>>>>>>   * config - A couch_config replacement (Behavior is mostly identical
>>>>>> to couch_config except how we listen for configuration changes
>>>>>> internally to allow for smooth hot code upgrade).
>>>>>>
>>>>>>   * twig - An rsyslog source replacement for couch_log.
>>>>>>
>>>>>>   * rexi - An RPC library. Replaces Erlang’s built-in rex application
>>>>>> to avoid costly safety measures in the interest of performance and
>>>>>> throughput.
>>>>>>
>>>>>>   * mem3 - The “Dynamo” part of BigCouch responsible for managing
>>>>> cluster state
>>>>>>
>>>>>>   * fabric - The internal cluster-aware CouachDB API
>>>>>>
>>>>>>   * ets_lru - A small library application that provides an LRU
>>>>>> implementation using a couple ets tables.
>>>>>>
>>>>>>   * ddoc_cache - Caches design documents on each node for use in
>>>>>> design handler functions. This uses an ets_lru cache with a very short
>>>>>> TTL.
>>>>>>
>>>>>>   * chttpd - The cluster aware HTTP layer
>>>>>>
>>>>>>
>>>>>> Each imported app also had its build system updated to use Autotools
>>>>>> along with the necessary updates noted above for the new application
>>>>>> layouts for existing CouchDB erlang apps.
>>>>>>
>>>>>>
>>>>>> * Merged a large amount of updates and fixes to couch_replicator based
>>>>>> on work done internally at Cloudant. Unfortunately due to an error
>>>>>> when we created our internal clone we lost a bit of history in some of
>>>>>> the initial merge and have a big commit that affects
>>>>>> couch_replicator_manager mostly. There are a number of other commits
>>>>>> related to couch_replicator that resolve the single node vs. clustered
>>>>>> differences. Some noticeable couch_replicator features:
>>>>>>
>>>>>>
>>>>>>   * Optionally disable checkpoints so that replication can work when
>>>>>> a source is read only. This should only be used for smaller databases
>>>>>> as each replication call has to scan the entire source database on
>>>>>> each invocation.
>>>>>>
>>>>>>   * A new changes_pending field in the _active_tasks output
>>>>>>
>>>>>>   * A fix to the continuous replication to automatically reconnect to
>>>>>> a continuous changes feed when it sees a last_seq value. This allows
>>>>>> for the source to selectively recycle the HTTP connections used which
>>>>>> can be quite useful for “permanent” replications.
>>>>>>
>>>>>>   * A multitude of smaller bug fix and stability enhancements.
>>>>>>
>>>>>>
>>>>>> Updates to single node couch:
>>>>>>
>>>>>>
>>>>>> * We changed the by_seq tree to store a copy of the #full_doc_info{}
>>>>>> record instead of the #doc_info{} record. This gives significant speed
>>>>>> improvements for compaction and replication and generally anything
>>>>>> that needs to walk the by_seq tree and access document bodies
>>>>>> internally.
>>>>>>
>>>>>> * We rewrote the compactor to be significantly faster as well as
>>>>>> provides significantly better compacted databases. The two main halves
>>>>>> are to use a temp file and replace the use of btrees in the temp file.
>>>>>> The temp file only contains a temporary copy of the document ids. At
>>>>>> the end of a compaction run we then rebuild the by_id btree in the
>>>>>> compaction file from this temp file. The reason this helps so much is
>>>>>> that the compaction is based on the update_seq btree, which for most
>>>>>> cases means that the id tree is updated in roughly random order which
>>>>>> is very bad for our append only btrees. By using the tmp file we can
>>>>>> stream it in order back into the compacted db file at the end of
>>>>>> compacting, generating a minimum amount of garbage in the process. The
>>>>>> other upgrade was to implement an external merge sort module
>>>>>> (couch_emsort) that is used with this temporary file.
>>>>>>
>>>>>> * Reject updates to design docs that introduce updates that break
>>>>>> compilation for source code. Currently we only check map and reduce
>>>>>> calls as the other should provide user visible errors instead of
>>>>>> inexplicably empty views.
>>>>>>
>>>>>> because my OCD kicked in and I was unable to resist.
>>>>>>
>>>>>> * Reverted a change made a long time ago that uses two file
>>>>>> descriptors for each database. See the todo list.
>>>>>>
>>>>>> * The reason to remove the second fd is so that we can rewrite ref
>>>>>> counting. Better ref counting makes everyone happy, but the real
>>>>>> reason is for this next bullet point:
>>>>>>
>>>>>> * Optimize couch_server to not require a round trip message pass for
>>>>>> opening a database that’s in the LRU. This is a significant
>>>>>> performance boost for high concurrency access. We also optimized
>>>>>> couch_server internals to not blow up when it’s under load.
>>>>>>
>>>>>> * Introduce a #leaf{} record into the revision trees. This is never
>>>>>> written to disk but makes internal code a lot cleaner when dealing
>>>>>> with multiple versions of rev tree values.
>>>>>>
>>>>>> * Some changes to couch_changes to enable clustered access. Also some
>>>>>> general cleanup
>>>>>>
>>>>>> * Internal changes to how CouchDB is booted in Erlang land. Not very
>>>>>> sexy but this removes a lot of complicated un-Erlangy bits. We still
>>>>>> have a bit of work left here.
>>>>>>
>>>>>> * btree chunk sizes are now configurable which can allow people to
>>>>>> adjust the RAM/speed tradeoffs a bit more.
>>>>>>
>>>>>> * We now load update validation functions on the first write. This is
>>>>>> a cluster-motivated change because the clustered version of this call
>>>>>> is expensive and can lead to race conditions when opening a bunch of
>>>>>> db shards simultaneously. This should be invisible to external
>>>>>> clients.
>>>>>>
>>>>>> * Disabled conflict detection for local docs. They don’t replicate so
>>>>>> there’s no point. This just led to clusters getting stuck and confused
>>>>>> when there were lots of replications happening.
>>>>>>
>>>>>> * Changes to the multipart/mime parsing code. Necessary for clustered
>>>>>> attachment uploads to split the incoming data  stream into N copies.
>>>>>>
>>>>>> * Don’t use init:restart/0 when reloading the ICU driver. I think
>>>>>> this has a bug. But we should rewrite this driver to be a NIF anyway.
>>>>>>
>>>>>> * New couch OS process manager. Significantly faster access to OS
>>>>>> processes under heavy load. This replaces the hard limit with a soft
>>>>>> limit. Process spawned over the soft limit will be used until they’ve
>>>>>> sat idle for a few minutes and then be closed. We have a todo item to
>>>>>> add the hard ceiling back in (while keeping the soft ceiling).
>>>>>>
>>>>>> * Automatically replace some easily identifiable JS reductions with
>>>>>> their builtin counterparts. Uses a regex to do the detection so its
>>>>>> not too smart.
>>>>>>
>>>>>> * Improved view updater write batch.
>>>>>>
>>>>>> * Updates to couchjs’ views.js to improve index update speeds
>>>>>>
>>>>>> * Updates to the _stats bultin reduce to allow reduces to work over
>>>>>> emitted stats objects. Sometimes clients have summary data in a doc,
>>>>>> and this allows them to combine stats if they follow the same pattern
>>>>>> as the builtin expects.
>>>>>>
>>>>>> * Added a config:reload() that is accessible by POST’ing to
>>>>>> _config/_reload. Used by the JS tests to reset the config to what's on
>>>>>> disk. This should prevent those test run failures where a test fails
>>>>>> leaving the config in a bad state causing all subsequent tests to
>>>>>> fail. I think. Maybe.
>>>>>>
>>>>>> * Databases are deleted synchronously in the test suite. We may need
>>>>>> to address this on Windows. But it does seem to reduce the number of
>>>>>> “{error, file_exists}” failures.
>>>>>>
>>>>>> * I reimplemented the JS restartServer() function. There’s a new
>>>>>> _restart/token URL that will given a unique value for each instance of
>>>>>> the Erlang VM. To run a restart we grab the current token value, hit
>>>>>> _restart, then wait till we get a successful response with a different
>>>>>> token. This appears to have made the restart strategy more robust.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Things that need doing
>>>>>>
>>>>>>
>>>>>> IP Clearance -
>>>>>>
>>>>>>
>>>>>> We’ll need to track down if we have the CCLA as well as look at each
>>>>>> source file added to make sure each one is strictly from Cloudant or
>>>>>> has an amenable license. I’m pretty sure that the only one of interest
>>>>>> is trunc_io.erl but we need to be thorough.
>>>>>>
>>>>>> documentation -
>>>>>>
>>>>>>
>>>>>> There shouldn’t be much here since the entire point of this merge was
>>>>>> to not change the visible behavior of single node couch. A few things
>>>>>> to add about the testing endpoints. Maybe an update to the compaction
>>>>>> section mention the two new file names used.
>>>>>>
>>>>>>
>>>>>> Copyright notices -
>>>>>>
>>>>>>
>>>>>> We need to strip out copyright notices from individual files and make
>>>>>> sure all files have a standard Apache License v2 header.
>>>>>>
>>>>>>
>>>>>> clustered vhosts -
>>>>>>
>>>>>>
>>>>>> We’ve never implemented this at Cloudant. We either need to write a
>>>>>> cluster or go back and tell people to use HAProxy (or similar) for
>>>>>> such things.
>>>>>>
>>>>>>
>>>>>> twig -
>>>>>>
>>>>>>
>>>>>> We need to add another output type to twig that is configurable in
>>>>>> some manner. Right now we spit out entire rsyslog records which isn’t
>>>>>> useful for most people. We’ll need to implement the file writer from
>>>>>> couch_log as well as update the _log HTTP handler to know when it can
>>>>>> and can’t expect to find data on disk.
>>>>>>
>>>>>>
>>>>>> fabric -
>>>>>>
>>>>>>
>>>>>> This is going to need a lot of work. Specifically view access is going
>>>>>> to need to be updated to work with couch_mrview and friends.
>>>>>>
>>>>>>
>>>>>> Boot a dev cluster -
>>>>>>
>>>>>>
>>>>>> Once we fix up the clustering code we’ll need to write instructions
>>>>>> and scripts for pulling up a dev cluster.
>>>>>>
>>>>>>
>>>>>> OTP stuff -
>>>>>>
>>>>>>
>>>>>> We’ve updated each app but we still need to pull some parts out of
>>>>>> couchdb into their own application. Specifically the HTTP layer needs
>>>>>> its own app. We could probably pull out the os process/query_servers
>>>>>> as well as the os daemons and friends. Once done we need to update the
>>>>>> supervision trees so we don’t have things like couch starting and
>>>>>> managing the replication manager process.
>>>>>>
>>>>>>
>>>>>> ddoc_cache -
>>>>>>
>>>>>>
>>>>>> Wire this up in couch_httpd_db to actually be used. Right now its only
>>>>>> used in chttpd.
>>>>>>
>>>>>>
>>>>>> couch_file upgrade -
>>>>>>
>>>>>>
>>>>>> The revert to remove the second updater_fd from each #db{} record
>>>>>> means that we’re back in the original position of files appearing to
>>>>>> slow down significantly under load. Since the initial hammer approach
>>>>>> of just adding a second fd we’ve since discovered that the underlying
>>>>>> bug is due to the way that message passing works combined with
>>>>>> Erlang’s file io. Significantly though is the fact that the fix is
>>>>>> rather simple to implement. A first draft of this work is on an old
>>>>>> branch of mine here:
>>>>>>
>>>>>>
>>>>>>   https://github.com/davisp/couchdb/commit/d856878
>>>>>>
>>>>>>
>>>>>> finish the size calculating changes -
>>>>>>
>>>>>>
>>>>>> The #leaf{} record change is to enable us to add more data size
>>>>>> calculations. CouchDB master calculates a data size that account for
>>>>>> all bytes that are active in a .couch file. Cloudant is interested in
>>>>>> the total size of uncompressed docs and attachments minus the internal
>>>>>> overhead of btrees. And there’s a fourth number to calculate based on
>>>>>> the compression level used. Having each of these numbers will be
>>>>>> useful as well as the calculations they’ll enable (ie, dead bytes in
>>>>>> file, bytes used for overhead, compression ratio achieved, etc).
>>>>>>
>>>>>>
>>>>>> couch_proc_manager -
>>>>>>
>>>>>>
>>>>>> We need to implement the hard ceiling for capping the number of OS
>>>>>> processes. We’ve started seeing a need for this at Cloudant with some
>>>>>> work loads so motivation to fix this is high. The only failing etap is
>>>>>> the assertion of this ceiling.
>>>>>>
>>>>>>
>>>>>> Synchronous db delete on Windows -
>>>>>>
>>>>>>
>>>>>> I did this because running the test suite was driving me bonkers. I
>>>>>> need to ask Dave about how this behaves on Windows (my guess is not
>>>>>> well) but I think we can close things up so that it works better than
>>>>>> the status quo.
>>>>>
>>
>> --
>> Joan Touzet | [email protected] | wohali everywhere else
>

Re: [VOTE] Merge BigCouch

Reply via email to