[jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
[ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085599#comment-13085599 ] Robert Newson commented on COUCHDB-1153: Could you hold off on this commit until after the srcmv? I'd really prefer to see it be added as a separate, optional application, not core. Different environments will need quite different approaches to compaction scheduling. It seems this patch causes a periodic scan of all_dbs? If so, I don't think that's going to fly in a hosted environment like Cloudant's (or, presumably, IrisCouch). Database and view index compaction daemon - Key: COUCHDB-1153 URL: https://issues.apache.org/jira/browse/COUCHDB-1153 Project: CouchDB Issue Type: New Feature Environment: trunk Reporter: Filipe Manana Assignee: Filipe Manana Priority: Minor Labels: compaction I've recently written an Erlang process to automatically compact databases and they're views based on some configurable parameters. These parameters can be global or per database and are: minimum database fragmentation, minimum view fragmentation, allowed period and strict_window (whether an ongoing compaction should be canceled if it doesn't finish within the allowed period). These fragmentation values are based on the recently added data_size parameter to the database and view group information URIs (COUCHDB-1132). I've documented the .ini configuration, as a comment in default.ini, which I paste here: [compaction_daemon] ; The delay, in seconds, between each check for which database and view indexes ; need to be compacted. check_interval = 60 ; If a database or view index file is smaller then this value (in bytes), ; compaction will not happen. Very small files always have a very high ; fragmentation therefore it's not worth to compact them. min_file_size = 131072 [compactions] ; List of compaction rules for the compaction daemon. ; The daemon compacts databases and they're respective view groups when all the ; condition parameters are satisfied. Configuration can be per database or ; global, and it has the following format: ; ; database_name = parameter=value [, parameter=value]* ; _default = parameter=value [, parameter=value]* ; ; Possible parameters: ; ; * db_fragmentation - If the ratio (as an integer percentage), of the amount ; of old data (and its supporting metadata) over the database ; file size is equal to or greater then this value, this ; database compaction condition is satisfied. ; This value is computed as: ; ; (file_size - data_size) / file_size * 100 ; ; The data_size and file_size values can be obtained when ; querying a database's information URI (GET /dbname/). ; ; * view_fragmentation - If the ratio (as an integer percentage), of the amount ;of old data (and its supporting metadata) over the view ;index (view group) file size is equal to or greater then ;this value, then this view index compaction condition is ;satisfied. This value is computed as: ; ;(file_size - data_size) / file_size * 100 ; ;The data_size and file_size values can be obtained when ;querying a view group's information URI ;(GET /dbname/_design/groupname/_info). ; ; * period - The period for which a database (and its view groups) compaction ;is allowed. This value must obey the following format: ; ;HH:MM - HH:MM (HH in [0..23], MM in [0..59]) ; ; * strict_window - If a compaction is still running after the end of the allowed ; period, it will be canceled if this parameter is set to yes. ; It defaults to no and it's meaningful only if the *period* ; parameter is also specified. ; ; * parallel_view_compaction - If set to yes, the database and its views are ; compacted in parallel. This is only useful on ; certain setups, like for example when the database ; and view index directories point to different ; disks. It defaults to no. ; ; Before a compaction is triggered, an estimation of how much free disk space is ; needed is computed. This estimation corresponds to 2 times the data size of ; the database or view index. When there's not enough free disk space to compact ; a particular database or view index, a warning
[jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
[ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085602#comment-13085602 ] Benoit Chesneau commented on COUCHDB-1153: -- I'm -1 on this patch. Passing db options in the ini file seems awkward. But I really like the idea of a daemon. We should rather have these options saved when creating a db via query parameters or headers. It may be the perfect time to transform this _security object in a _meta object used to save such db's settings . So we could do : create a db: PUT /db?db_fragmentation= Update setting PUT /db/_meta Options could be passed as a meta document when creating the db too rather than passing an empty body. Note the _meta object could be later used for other purposes by app developers to annotate a db.. Like some devs already do with this _security object. Database and view index compaction daemon - Key: COUCHDB-1153 URL: https://issues.apache.org/jira/browse/COUCHDB-1153 Project: CouchDB Issue Type: New Feature Environment: trunk Reporter: Filipe Manana Assignee: Filipe Manana Priority: Minor Labels: compaction I've recently written an Erlang process to automatically compact databases and they're views based on some configurable parameters. These parameters can be global or per database and are: minimum database fragmentation, minimum view fragmentation, allowed period and strict_window (whether an ongoing compaction should be canceled if it doesn't finish within the allowed period). These fragmentation values are based on the recently added data_size parameter to the database and view group information URIs (COUCHDB-1132). I've documented the .ini configuration, as a comment in default.ini, which I paste here: [compaction_daemon] ; The delay, in seconds, between each check for which database and view indexes ; need to be compacted. check_interval = 60 ; If a database or view index file is smaller then this value (in bytes), ; compaction will not happen. Very small files always have a very high ; fragmentation therefore it's not worth to compact them. min_file_size = 131072 [compactions] ; List of compaction rules for the compaction daemon. ; The daemon compacts databases and they're respective view groups when all the ; condition parameters are satisfied. Configuration can be per database or ; global, and it has the following format: ; ; database_name = parameter=value [, parameter=value]* ; _default = parameter=value [, parameter=value]* ; ; Possible parameters: ; ; * db_fragmentation - If the ratio (as an integer percentage), of the amount ; of old data (and its supporting metadata) over the database ; file size is equal to or greater then this value, this ; database compaction condition is satisfied. ; This value is computed as: ; ; (file_size - data_size) / file_size * 100 ; ; The data_size and file_size values can be obtained when ; querying a database's information URI (GET /dbname/). ; ; * view_fragmentation - If the ratio (as an integer percentage), of the amount ;of old data (and its supporting metadata) over the view ;index (view group) file size is equal to or greater then ;this value, then this view index compaction condition is ;satisfied. This value is computed as: ; ;(file_size - data_size) / file_size * 100 ; ;The data_size and file_size values can be obtained when ;querying a view group's information URI ;(GET /dbname/_design/groupname/_info). ; ; * period - The period for which a database (and its view groups) compaction ;is allowed. This value must obey the following format: ; ;HH:MM - HH:MM (HH in [0..23], MM in [0..59]) ; ; * strict_window - If a compaction is still running after the end of the allowed ; period, it will be canceled if this parameter is set to yes. ; It defaults to no and it's meaningful only if the *period* ; parameter is also specified. ; ; * parallel_view_compaction - If set to yes, the database and its views are ; compacted in parallel. This is only useful on ; certain setups, like for example when the database ; and view index directories point to different ; disks. It defaults to no. ; ;
[jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
[ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085603#comment-13085603 ] Benoit Chesneau commented on COUCHDB-1153: -- about the _all_dbs scanning, maybe we could have a database maintaing created dbs like cloudant do. Or Elasticsearch for that purpose. Rather than scanning _all_dbs it oculd react on _changes ? Database and view index compaction daemon - Key: COUCHDB-1153 URL: https://issues.apache.org/jira/browse/COUCHDB-1153 Project: CouchDB Issue Type: New Feature Environment: trunk Reporter: Filipe Manana Assignee: Filipe Manana Priority: Minor Labels: compaction I've recently written an Erlang process to automatically compact databases and they're views based on some configurable parameters. These parameters can be global or per database and are: minimum database fragmentation, minimum view fragmentation, allowed period and strict_window (whether an ongoing compaction should be canceled if it doesn't finish within the allowed period). These fragmentation values are based on the recently added data_size parameter to the database and view group information URIs (COUCHDB-1132). I've documented the .ini configuration, as a comment in default.ini, which I paste here: [compaction_daemon] ; The delay, in seconds, between each check for which database and view indexes ; need to be compacted. check_interval = 60 ; If a database or view index file is smaller then this value (in bytes), ; compaction will not happen. Very small files always have a very high ; fragmentation therefore it's not worth to compact them. min_file_size = 131072 [compactions] ; List of compaction rules for the compaction daemon. ; The daemon compacts databases and they're respective view groups when all the ; condition parameters are satisfied. Configuration can be per database or ; global, and it has the following format: ; ; database_name = parameter=value [, parameter=value]* ; _default = parameter=value [, parameter=value]* ; ; Possible parameters: ; ; * db_fragmentation - If the ratio (as an integer percentage), of the amount ; of old data (and its supporting metadata) over the database ; file size is equal to or greater then this value, this ; database compaction condition is satisfied. ; This value is computed as: ; ; (file_size - data_size) / file_size * 100 ; ; The data_size and file_size values can be obtained when ; querying a database's information URI (GET /dbname/). ; ; * view_fragmentation - If the ratio (as an integer percentage), of the amount ;of old data (and its supporting metadata) over the view ;index (view group) file size is equal to or greater then ;this value, then this view index compaction condition is ;satisfied. This value is computed as: ; ;(file_size - data_size) / file_size * 100 ; ;The data_size and file_size values can be obtained when ;querying a view group's information URI ;(GET /dbname/_design/groupname/_info). ; ; * period - The period for which a database (and its view groups) compaction ;is allowed. This value must obey the following format: ; ;HH:MM - HH:MM (HH in [0..23], MM in [0..59]) ; ; * strict_window - If a compaction is still running after the end of the allowed ; period, it will be canceled if this parameter is set to yes. ; It defaults to no and it's meaningful only if the *period* ; parameter is also specified. ; ; * parallel_view_compaction - If set to yes, the database and its views are ; compacted in parallel. This is only useful on ; certain setups, like for example when the database ; and view index directories point to different ; disks. It defaults to no. ; ; Before a compaction is triggered, an estimation of how much free disk space is ; needed is computed. This estimation corresponds to 2 times the data size of ; the database or view index. When there's not enough free disk space to compact ; a particular database or view index, a warning message is logged. ; ; Examples: ; ; 1) foo = db_fragmentation = 70%, view_fragmentation = 60% ;The `foo` database is compacted if its fragmentation is 70% or more. ;Any view index
[jira] [Commented] (COUCHDB-1012) Utility to help plugin developers manage paths
[ https://issues.apache.org/jira/browse/COUCHDB-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085607#comment-13085607 ] Benoit Chesneau commented on COUCHDB-1012: -- most of multi-platforms programs provide both (pkg-config and their own config stuff) sicne pkg-config isn't installed by default on some platforms. Do you think it could be a problem to have both too with couchdb? Also what is the status of this ticket? What **actions** should we do to close it in near future? Utility to help plugin developers manage paths -- Key: COUCHDB-1012 URL: https://issues.apache.org/jira/browse/COUCHDB-1012 Project: CouchDB Issue Type: New Feature Components: Build System Reporter: Randall Leeds Assignee: Randall Leeds Fix For: 1.2 Attachments: 0001-add-couch-config-file-used-to-ease-the-build-of-plug.patch, 0001-add-couch-config-file-used-to-ease-the-build-of-plug.patch, 0001-support-pkg-config-for-plugins-COUCHDB-1012.patch Developers may want to write plugins (like GeoCouch) for CouchDB. Many hooks in the configuration system allow loading arbitrary Erlang modules to handle various internal tasks, but currently there is no straightforward and portable way for developers of these plugins to discover the location of the CouchDB library files. Two options that have been proposed are to use pkg-config or install a separate script that could be invoked (e.g. as couch-config --erl-libs) to discover important CouchDB installation paths. As far as I know the loudest argument against pkg-config is lack of support for Windows. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
On Tue, Aug 16, 2011 at 11:30 AM, Filipe Manana (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085605#comment-13085605 ] Filipe Manana commented on COUCHDB-1153: I'm -1 on adding such a _meta thing. why? I don't understand either that idea of _changes nor how it can be applied. creating db, adding db document to dbs db., update - update db document.
[jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
[ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085613#comment-13085613 ] Benoit Chesneau commented on COUCHDB-1153: -- why not? I'm -1 on -1 without any arguments. And... security object is already used for such uses around. Annotating dbs is also something people wants around. Creating a db - create a db document, Update - update ? Simple enough. Can be used by people who want to have a db listener for any purpose. (Also solve an old ticket). Database and view index compaction daemon - Key: COUCHDB-1153 URL: https://issues.apache.org/jira/browse/COUCHDB-1153 Project: CouchDB Issue Type: New Feature Environment: trunk Reporter: Filipe Manana Assignee: Filipe Manana Priority: Minor Labels: compaction I've recently written an Erlang process to automatically compact databases and they're views based on some configurable parameters. These parameters can be global or per database and are: minimum database fragmentation, minimum view fragmentation, allowed period and strict_window (whether an ongoing compaction should be canceled if it doesn't finish within the allowed period). These fragmentation values are based on the recently added data_size parameter to the database and view group information URIs (COUCHDB-1132). I've documented the .ini configuration, as a comment in default.ini, which I paste here: [compaction_daemon] ; The delay, in seconds, between each check for which database and view indexes ; need to be compacted. check_interval = 60 ; If a database or view index file is smaller then this value (in bytes), ; compaction will not happen. Very small files always have a very high ; fragmentation therefore it's not worth to compact them. min_file_size = 131072 [compactions] ; List of compaction rules for the compaction daemon. ; The daemon compacts databases and they're respective view groups when all the ; condition parameters are satisfied. Configuration can be per database or ; global, and it has the following format: ; ; database_name = parameter=value [, parameter=value]* ; _default = parameter=value [, parameter=value]* ; ; Possible parameters: ; ; * db_fragmentation - If the ratio (as an integer percentage), of the amount ; of old data (and its supporting metadata) over the database ; file size is equal to or greater then this value, this ; database compaction condition is satisfied. ; This value is computed as: ; ; (file_size - data_size) / file_size * 100 ; ; The data_size and file_size values can be obtained when ; querying a database's information URI (GET /dbname/). ; ; * view_fragmentation - If the ratio (as an integer percentage), of the amount ;of old data (and its supporting metadata) over the view ;index (view group) file size is equal to or greater then ;this value, then this view index compaction condition is ;satisfied. This value is computed as: ; ;(file_size - data_size) / file_size * 100 ; ;The data_size and file_size values can be obtained when ;querying a view group's information URI ;(GET /dbname/_design/groupname/_info). ; ; * period - The period for which a database (and its view groups) compaction ;is allowed. This value must obey the following format: ; ;HH:MM - HH:MM (HH in [0..23], MM in [0..59]) ; ; * strict_window - If a compaction is still running after the end of the allowed ; period, it will be canceled if this parameter is set to yes. ; It defaults to no and it's meaningful only if the *period* ; parameter is also specified. ; ; * parallel_view_compaction - If set to yes, the database and its views are ; compacted in parallel. This is only useful on ; certain setups, like for example when the database ; and view index directories point to different ; disks. It defaults to no. ; ; Before a compaction is triggered, an estimation of how much free disk space is ; needed is computed. This estimation corresponds to 2 times the data size of ; the database or view index. When there's not enough free disk space to compact ; a particular database or view index, a warning message is logged. ; ; Examples: ; ; 1) foo =
Re: [jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
On Tue, Aug 16, 2011 at 11:46 AM, Filipe David Manana fdman...@apache.org wrote: On Tue, Aug 16, 2011 at 2:38 AM, Benoit Chesneau bchesn...@gmail.com wrote: On Tue, Aug 16, 2011 at 11:30 AM, Filipe Manana (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085605#comment-13085605 ] Filipe Manana commented on COUCHDB-1153: I'm -1 on adding such a _meta thing. why? From your description, that _meta sounds like something that can be done with _local docs. But that is a whole separate discussion and topic I think. Could be a local docs, But why didn't we took this path for this _security object ? Also since they are really meta informations, i've the feeling it should be solved as a special member in the db file, just like the _security object. Anyway what I really dislike is saving per db configuration in an ini file. Per db configuration should be done on the db. What if you more than 100 dbs. Having 100 lines in an ini file to parse is awkward. meta informations (like security, db params, ...) should be saved in the db file and available in the same time. Since we have already this _security object that is available when you open why not reusing it ? I don't understand either that idea of _changes nor how it can be applied. creating db, adding db document to dbs db., update - update db document. You'll have to elaborate a lot more than that :) I'm not familiar with that bigcouch special db nor elasticsearch. Reacting to a changes feed of some database it's not something easy (the _replicator db is such a case and might have been the hardest thing i did ever for couch, really) This is just as simple as this line, creating a db create an entry in a db index (or db file) that you can use later. I suspect what you think is something like rather than scanning periodically, to let the daemon be notified when a db (or view) can be compacted? At some point I considered reacting to db_updated events but this was pretty much flooding the the event handler (daemon). Was this your idea? Using db events is my idea yes. If t actually flood the db event handler (not sure why), then maybe we should fix it first? - benoit
compaction plugin, auth handler, foo plugin couchdb core
Hi devs, Today I see lot of interesting things coming in CouchDB, but also lot of different interests and different usages. Sometimes you need to extend couch for your usage. But today if you except the current work on the view engine by paul, the couchdb code become more and more an aggregation of code fixing speci
Remove 1.0.2 release from Apache Mirrors
Hi, in the spirit of keeping things clean I'd like to remove the 1.0.2 release from the mirrors and put it into the archive. If nobody objects, I'll do this tomorrow. Cheers Jan --
compaction plugin, auth handler, foo plugin couchdb core [resent]
Hi devs; Today I see lot of interesting things coming in CouchDB, but also lot of different interests and different usages. Sometimes you need to extend couch for your usage. But today if you except the current work on the view engine by paul, the couchdb code become more and more monolithic or an aggregation of code adding some specific features/changes, while not envisioning what could be done by others. Also the way you have to extend couchdb make it difficult today to use/merge/... different forks around like the one done by cloudant, couchbase and even mine in refuge/upondata probably some others too). Couch core should be lighter and more open (in its strict sense). For example today, http layer(?), replicator(?), proxy, external daemons, couchapp engine, rewriter, vhosts, compaction daemon, some auth handler could be available as plugins. couch_config could be more generic and not relying on an ini file. More specifically we could have a couch core looking more like a mnesia alternative, the couchdb application, which could be couch core + plugins, distributed as a standalone app (like couchdb is actually). This would also maybe allow cloudant, couchbase and other to reuse the same core rather than forking it while adding their own plugins. Official Plugins could also be maintained as standalone projects maybe. I wish we could concentrate on that topic for 2.0x and make it a priority. That would imply to define what is the couch core, split the code [1] and what is a plugin [2]. Maybe the couchdb app can also be a full erlang release [3] built with autotools. I think that this plugabble structure should be done for example before adding any new daemon like the compaction daemon. Don't get me wrong I really like the idea to have a default compaction daemon in the couchdb app, and this is just an example. But I also want the possibility to add mine working differently (or not) and this should be done for the default couchdb release, couch core imo should be more neutral. Maybe we could start by opening tickets about different tasks to track them? What is blocking the split currently since the 1.0.3 is out? Do we wait for the svn-to-git conversion? - benoît [1] https://github.com/davisp/couchdb-srcmv [2] https://issues.apache.org/jira/browse/COUCHDB-1012 [3] http://www.erlang.org/doc/design_principles/release_structure.html
Re: compaction plugin, auth handler, foo plugin couchdb core [resent]
+1 on splitting into more focused and OTP compliant applications. Separating core from httpd in particular. Sent from my iPhone On 16 Aug 2011, at 12:21, Benoit Chesneau bchesn...@gmail.com wrote: Hi devs; Today I see lot of interesting things coming in CouchDB, but also lot of different interests and different usages. Sometimes you need to extend couch for your usage. But today if you except the current work on the view engine by paul, the couchdb code become more and more monolithic or an aggregation of code adding some specific features/changes, while not envisioning what could be done by others. Also the way you have to extend couchdb make it difficult today to use/merge/... different forks around like the one done by cloudant, couchbase and even mine in refuge/upondata probably some others too). Couch core should be lighter and more open (in its strict sense). For example today, http layer(?), replicator(?), proxy, external daemons, couchapp engine, rewriter, vhosts, compaction daemon, some auth handler could be available as plugins. couch_config could be more generic and not relying on an ini file. More specifically we could have a couch core looking more like a mnesia alternative, the couchdb application, which could be couch core + plugins, distributed as a standalone app (like couchdb is actually). This would also maybe allow cloudant, couchbase and other to reuse the same core rather than forking it while adding their own plugins. Official Plugins could also be maintained as standalone projects maybe. I wish we could concentrate on that topic for 2.0x and make it a priority. That would imply to define what is the couch core, split the code [1] and what is a plugin [2]. Maybe the couchdb app can also be a full erlang release [3] built with autotools. I think that this plugabble structure should be done for example before adding any new daemon like the compaction daemon. Don't get me wrong I really like the idea to have a default compaction daemon in the couchdb app, and this is just an example. But I also want the possibility to add mine working differently (or not) and this should be done for the default couchdb release, couch core imo should be more neutral. Maybe we could start by opening tickets about different tasks to track them? What is blocking the split currently since the 1.0.3 is out? Do we wait for the svn-to-git conversion? - benoît [1] https://github.com/davisp/couchdb-srcmv [2] https://issues.apache.org/jira/browse/COUCHDB-1012 [3] http://www.erlang.org/doc/design_principles/release_structure.html
Re: compaction plugin, auth handler, foo plugin couchdb core [resent]
Hi Benoit, thanks for raising this again. I think we have a good plan to get started but it wouldn't hurt to get a little more organised. I think the plan is as follows: 1. Move to git, this makes all the subsequent steps more easy. 2. srcmv, reorganising the source code so we are prepared to do all the things you mention and all the other things we talked about in the past :) 3. Profit. -- As for my wish list, all this post the git move: We could release 1.2 based off of current trunk + a few of the more useful JIRA patches that we haven't committed yet. After 1.2.x is branched, srcmv trunk and start the internal refactoring and pluginnifying and release 1.3 off that. At some point merging between before and after srcmv merging patches is going to be a pain, so I'd like to keep that time as short as possible and thus keep the differences between 1.2 and 1.3 (given that these are the border cases) as small as possible. Cheers Jan -- On Aug 16, 2011, at 1:20 PM, Benoit Chesneau wrote: Hi devs; Today I see lot of interesting things coming in CouchDB, but also lot of different interests and different usages. Sometimes you need to extend couch for your usage. But today if you except the current work on the view engine by paul, the couchdb code become more and more monolithic or an aggregation of code adding some specific features/changes, while not envisioning what could be done by others. Also the way you have to extend couchdb make it difficult today to use/merge/... different forks around like the one done by cloudant, couchbase and even mine in refuge/upondata probably some others too). Couch core should be lighter and more open (in its strict sense). For example today, http layer(?), replicator(?), proxy, external daemons, couchapp engine, rewriter, vhosts, compaction daemon, some auth handler could be available as plugins. couch_config could be more generic and not relying on an ini file. More specifically we could have a couch core looking more like a mnesia alternative, the couchdb application, which could be couch core + plugins, distributed as a standalone app (like couchdb is actually). This would also maybe allow cloudant, couchbase and other to reuse the same core rather than forking it while adding their own plugins. Official Plugins could also be maintained as standalone projects maybe. I wish we could concentrate on that topic for 2.0x and make it a priority. That would imply to define what is the couch core, split the code [1] and what is a plugin [2]. Maybe the couchdb app can also be a full erlang release [3] built with autotools. I think that this plugabble structure should be done for example before adding any new daemon like the compaction daemon. Don't get me wrong I really like the idea to have a default compaction daemon in the couchdb app, and this is just an example. But I also want the possibility to add mine working differently (or not) and this should be done for the default couchdb release, couch core imo should be more neutral. Maybe we could start by opening tickets about different tasks to track them? What is blocking the split currently since the 1.0.3 is out? Do we wait for the svn-to-git conversion? - benoît [1] https://github.com/davisp/couchdb-srcmv [2] https://issues.apache.org/jira/browse/COUCHDB-1012 [3] http://www.erlang.org/doc/design_principles/release_structure.html
Re: Bringing automatic compaction into trunk
Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. I also fear that a srcmv'd release is still out a bit and I'd really like to see this one (and a few others) go into 1.2 (as per my previous mail to this list in another thread). While it isn't the absolute perfect solution in all cases, it is disabled by default and manual compaction strategies work as they did before. In the meantime, we can refine the rest of the system to make it more fully fledged and maybe even enable it by default a few versions down when we are all comfortable with it. I'm not very comfortable keeping good patches in JIRA and not trunk until they solve every little edge case. We haven't worked like this in the past and I don't think it is worth doing. Cheers Jan -- Regards, Bob On Aug 15, 2011, at 9:29 PM, Filipe David Manana wrote: Developers, users, It's been a while now since I opened a Jira ticket for it ( https://issues.apache.org/jira/browse/COUCHDB-1153 ). I won't describe it here with detail since it's already done in the Jira ticket. Unless there are objections, I would like to get this moving soon. Thanks -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Bringing automatic compaction into trunk
I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. I also fear that a srcmv'd release is still out a bit and I'd really like to see this one (and a few others) go into 1.2 (as per my previous mail to this list in another thread). While it isn't the absolute perfect solution in all cases, it is disabled by default and manual compaction strategies work as they did before. In the meantime, we can refine the rest of the system to make it more fully fledged and maybe even enable it by default a few versions down when we are all comfortable with it. I'm not very comfortable keeping good patches in JIRA and not trunk until they solve every little edge case. We haven't worked like this in the past and I don't think it is worth doing. Cheers Jan -- Regards, Bob On Aug 15, 2011, at 9:29 PM, Filipe David Manana wrote: Developers, users, It's been a while now since I opened a Jira ticket for it ( https://issues.apache.org/jira/browse/COUCHDB-1153 ). I won't describe it here with detail since it's already done in the Jira ticket. Unless there are objections, I would like to get this moving soon. Thanks -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Bringing automatic compaction into trunk
On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. I also fear that a srcmv'd release is still out a bit and I'd really like to see this one (and a few others) go into 1.2 (as per my previous mail to this list in another thread). While it isn't the absolute perfect solution in all cases, it is disabled by default and manual compaction strategies work as they did before. In the meantime, we can refine the rest of the system to make it more fully fledged and maybe even enable it by default a few versions down when we are all comfortable with it. I'm not very comfortable keeping good patches in JIRA and not trunk until they solve every little edge case. We haven't worked like this in the past and I don't think it is worth doing. Cheers Jan -- Regards, Bob On Aug 15, 2011, at 9:29 PM, Filipe David Manana wrote: Developers, users, It's been a while now since I opened a Jira ticket for it ( https://issues.apache.org/jira/browse/COUCHDB-1153 ). I won't describe it here with detail since it's already done in the Jira ticket. Unless there are objections, I would like to get this moving soon. Thanks -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: Bringing automatic compaction into trunk
All good points Jan, thanks. Having large numbers of databases is one thing, but I'm focused on the impact on ongoing operations with this running in the background. What does it do to the users experience to have all dbs scanned periodically, etc? The reason I suggest doing it after the move, and in its own app, is to reduce the work needed to not use this code in some circumstances (Cloudant hosting, for example). Yes, it's a separate module and disabled by default, but putting it in its own application will make the separation much more explicit and preclude unintended entanglements with core over time. B. On 16 August 2011 14:31, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. I also fear that a srcmv'd release is still out a bit and I'd really like to see this one (and a few others) go into 1.2 (as per my previous mail to this list in another thread). While it isn't the absolute perfect solution in all cases, it is disabled by default and manual compaction strategies work as they did before. In the meantime, we can refine the rest of the system to make it more fully fledged and maybe even enable it by default a few versions down when we are all comfortable with it. I'm not very comfortable keeping good patches in JIRA and not trunk until they solve every little edge case. We haven't worked like this in the past and I don't
Re: Bringing automatic compaction into trunk
On Aug 16, 2011, at 3:44 PM, Robert Newson wrote: All good points Jan, thanks. Having large numbers of databases is one thing, but I'm focused on the impact on ongoing operations with this running in the background. What does it do to the users experience to have all dbs scanned periodically, etc? The reason I suggest doing it after the move, and in its own app, is to reduce the work needed to not use this code in some circumstances (Cloudant hosting, for example). Yes, it's a separate module and disabled by default, but putting it in its own application will make the separation much more explicit and preclude unintended entanglements with core over time. I think this is a valid concern, but I don't think it outweighs the disadvantage. I'm happy to spend time to make sure this is properly modular after srcmv. Cheers Jan -- B. On 16 August 2011 14:31, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. I also fear that a srcmv'd release is still out a bit and I'd really like to see this one (and a few others) go into 1.2 (as per my previous mail to this list in another thread). While it isn't the absolute perfect solution in all cases, it is disabled by default and manual compaction strategies work as they did before. In the meantime, we can refine the rest of the system to make it more fully
Re: Bringing automatic compaction into trunk
Ok, let's see Pauls' code concerns addressed first, it needs that cleanup before it can hit trunk. I'd still prefer to see an event-driven rather than polling approach, e.g, hook into update_notifier and build a queue of databases that are actively being written to (and therefore growing). A much lazier background thing could compact databases that are inactive. B. On 16 August 2011 14:48, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 3:44 PM, Robert Newson wrote: All good points Jan, thanks. Having large numbers of databases is one thing, but I'm focused on the impact on ongoing operations with this running in the background. What does it do to the users experience to have all dbs scanned periodically, etc? The reason I suggest doing it after the move, and in its own app, is to reduce the work needed to not use this code in some circumstances (Cloudant hosting, for example). Yes, it's a separate module and disabled by default, but putting it in its own application will make the separation much more explicit and preclude unintended entanglements with core over time. I think this is a valid concern, but I don't think it outweighs the disadvantage. I'm happy to spend time to make sure this is properly modular after srcmv. Cheers Jan -- B. On 16 August 2011 14:31, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked about. Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated.
Re: Bringing automatic compaction into trunk
On Aug 16, 2011, at 4:00 PM, Robert Newson wrote: Ok, let's see Pauls' code concerns addressed first, it needs that cleanup before it can hit trunk. I'd still prefer to see an event-driven rather than polling approach, e.g, hook into update_notifier and build a queue of databases that are actively being written to (and therefore growing). A much lazier background thing could compact databases that are inactive. Jup, my discussion was barring that all that is sorted out as an implementation detail. Back to JIRA. Cheers Jan -- B. On 16 August 2011 14:48, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 3:44 PM, Robert Newson wrote: All good points Jan, thanks. Having large numbers of databases is one thing, but I'm focused on the impact on ongoing operations with this running in the background. What does it do to the users experience to have all dbs scanned periodically, etc? The reason I suggest doing it after the move, and in its own app, is to reduce the work needed to not use this code in some circumstances (Cloudant hosting, for example). Yes, it's a separate module and disabled by default, but putting it in its own application will make the separation much more explicit and preclude unintended entanglements with core over time. I think this is a valid concern, but I don't think it outweighs the disadvantage. I'm happy to spend time to make sure this is properly modular after srcmv. Cheers Jan -- B. On 16 August 2011 14:31, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config is desirable, but I think outside of the scope of the ticket. Anyway I like it a lot though I've only read the code for 1/2 and hour or so. I also agree with others that the code base is reaching a point of being a bit crufty and it might be a good time with the git migration, etc.. to take a breath and commit to making some of these OTP compliant changes and design changes we've talked
The replicator needs a superuser mode
One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
+1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. +1 on /db/_dump and /db/_restore endpoints!! Very beneficial to us little people trying to make installers like couchapp-takeout, and could even be used from futon to create a database from a remote db. I am anecdotally noticing that using replication to create a local database from a remote one with lots of attachments takes a long time, is prone to timeouts, and gets stuck (been working with jhs on this). Dump/restore will be also much faster, eliminating the small requests.
Re: compaction plugin, auth handler, foo plugin couchdb core [resent]
On Tue, Aug 16, 2011 at 6:45 AM, Jan Lehnardt j...@apache.org wrote: Hi Benoit, thanks for raising this again. I think we have a good plan to get started but it wouldn't hurt to get a little more organised. I think the plan is as follows: 1. Move to git, this makes all the subsequent steps more easy. 2. srcmv, reorganising the source code so we are prepared to do all the things you mention and all the other things we talked about in the past :) 3. Profit. -- As for my wish list, all this post the git move: We could release 1.2 based off of current trunk + a few of the more useful JIRA patches that we haven't committed yet. After 1.2.x is branched, srcmv trunk and start the internal refactoring and pluginnifying and release 1.3 off that. At some point merging between before and after srcmv merging patches is going to be a pain, so I'd like to keep that time as short as possible and thus keep the differences between 1.2 and 1.3 (given that these are the border cases) as small as possible. Cheers Jan -- Early morning pre-caffeine but this sounds like a pretty good idea to my addled brain. On Aug 16, 2011, at 1:20 PM, Benoit Chesneau wrote: Hi devs; Today I see lot of interesting things coming in CouchDB, but also lot of different interests and different usages. Sometimes you need to extend couch for your usage. But today if you except the current work on the view engine by paul, the couchdb code become more and more monolithic or an aggregation of code adding some specific features/changes, while not envisioning what could be done by others. Also the way you have to extend couchdb make it difficult today to use/merge/... different forks around like the one done by cloudant, couchbase and even mine in refuge/upondata probably some others too). Couch core should be lighter and more open (in its strict sense). For example today, http layer(?), replicator(?), proxy, external daemons, couchapp engine, rewriter, vhosts, compaction daemon, some auth handler could be available as plugins. couch_config could be more generic and not relying on an ini file. More specifically we could have a couch core looking more like a mnesia alternative, the couchdb application, which could be couch core + plugins, distributed as a standalone app (like couchdb is actually). This would also maybe allow cloudant, couchbase and other to reuse the same core rather than forking it while adding their own plugins. Official Plugins could also be maintained as standalone projects maybe. I wish we could concentrate on that topic for 2.0x and make it a priority. That would imply to define what is the couch core, split the code [1] and what is a plugin [2]. Maybe the couchdb app can also be a full erlang release [3] built with autotools. I think that this plugabble structure should be done for example before adding any new daemon like the compaction daemon. Don't get me wrong I really like the idea to have a default compaction daemon in the couchdb app, and this is just an example. But I also want the possibility to add mine working differently (or not) and this should be done for the default couchdb release, couch core imo should be more neutral. Maybe we could start by opening tickets about different tasks to track them? What is blocking the split currently since the 1.0.3 is out? Do we wait for the svn-to-git conversion? - benoît [1] https://github.com/davisp/couchdb-srcmv [2] https://issues.apache.org/jira/browse/COUCHDB-1012 [3] http://www.erlang.org/doc/design_principles/release_structure.html
Re: The replicator needs a superuser mode
Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: Remove 1.0.2 release from Apache Mirrors
Not only should no one object, but infrastructure would object to people objecting. :D Thanks for helping with and cleaning up this release. I'll try and not be moving across country when I do the next one. On Tue, Aug 16, 2011 at 5:35 AM, Jan Lehnardt j...@apache.org wrote: Hi, in the spirit of keeping things clean I'd like to remove the 1.0.2 release from the mirrors and put it into the archive. If nobody objects, I'll do this tomorrow. Cheers Jan --
Re: The replicator needs a superuser mode
We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
Both rsync an scp won't allow me to do curl http://couch/db/_dump | curl http://couch/db/_restore. I acknowledge that similar solutions exist, but using the http transport allows for more fun things down the road. See what we are doing with _changes today where DbUpdateNotifications nearly do the same thing. Cheers Jan -- On 16.08.2011, at 19:13, Nathan Vander Wilt nate-li...@calftrail.com wrote: We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
Wow, this thread got hijacked a bit :) Anyone object to the special role that has the skip validation superpower? Adam On Aug 16, 2011, at 1:51 PM, Jan Lehnardt wrote: Both rsync an scp won't allow me to do curl http://couch/db/_dump | curl http://couch/db/_restore. I acknowledge that similar solutions exist, but using the http transport allows for more fun things down the road. See what we are doing with _changes today where DbUpdateNotifications nearly do the same thing. Cheers Jan -- On 16.08.2011, at 19:13, Nathan Vander Wilt nate-li...@calftrail.com wrote: We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
No objection, just the question of why the need for a new role, why not use admin? On Aug 16, 2011, at 2:10 PM, Adam Kocoloski wrote: Wow, this thread got hijacked a bit :) Anyone object to the special role that has the skip validation superpower? Adam On Aug 16, 2011, at 1:51 PM, Jan Lehnardt wrote: Both rsync an scp won't allow me to do curl http://couch/db/_dump | curl http://couch/db/_restore. I acknowledge that similar solutions exist, but using the http transport allows for more fun things down the road. See what we are doing with _changes today where DbUpdateNotifications nearly do the same thing. Cheers Jan -- On 16.08.2011, at 19:13, Nathan Vander Wilt nate-li...@calftrail.com wrote: We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
no objection to special role. As in my opening statement, would be concerned about adding it to _admin without devoting more thought to possible unintended consequences. b. On 16 August 2011 19:13, Robert Dionne dio...@dionne-associates.com wrote: No objection, just the question of why the need for a new role, why not use admin? On Aug 16, 2011, at 2:10 PM, Adam Kocoloski wrote: Wow, this thread got hijacked a bit :) Anyone object to the special role that has the skip validation superpower? Adam On Aug 16, 2011, at 1:51 PM, Jan Lehnardt wrote: Both rsync an scp won't allow me to do curl http://couch/db/_dump | curl http://couch/db/_restore. I acknowledge that similar solutions exist, but using the http transport allows for more fun things down the road. See what we are doing with _changes today where DbUpdateNotifications nearly do the same thing. Cheers Jan -- On 16.08.2011, at 19:13, Nathan Vander Wilt nate-li...@calftrail.com wrote: We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
Hmm, if we used a separate role we'd need a multi-step process to trigger the replication 1) create the database 2) have an admin grant the _skip_validation role on that DB to the replicator's user_ctx 3) trigger the replication Kind of annoying. Certainly would be simpler to allow _admins to do this if just by adding a skip_validation=true parameter to write requests. Adam On Aug 16, 2011, at 2:21 PM, Robert Newson wrote: no objection to special role. As in my opening statement, would be concerned about adding it to _admin without devoting more thought to possible unintended consequences. b. On 16 August 2011 19:13, Robert Dionne dio...@dionne-associates.com wrote: No objection, just the question of why the need for a new role, why not use admin? On Aug 16, 2011, at 2:10 PM, Adam Kocoloski wrote: Wow, this thread got hijacked a bit :) Anyone object to the special role that has the skip validation superpower? Adam On Aug 16, 2011, at 1:51 PM, Jan Lehnardt wrote: Both rsync an scp won't allow me to do curl http://couch/db/_dump | curl http://couch/db/_restore. I acknowledge that similar solutions exist, but using the http transport allows for more fun things down the road. See what we are doing with _changes today where DbUpdateNotifications nearly do the same thing. Cheers Jan -- On 16.08.2011, at 19:13, Nathan Vander Wilt nate-li...@calftrail.com wrote: We've already got replication, _all_docs and some really robust on-disk consistency properties. For shuttling raw database files between servers, wouldn't rsync be more efficient (and fit better within existing sysadmin security/deployment structures)? -nvw On Aug 16, 2011, at 9:55 AM, Paul Davis wrote: Me and Adam were just mulling over a similar endpoint the other night that could be used to generate plain-text backups similar to what couchdb-dump and couchdb-load were doing. With the idea that there would be some special sauce to pipe from one _dump endpoint directly into a different _load handler. Obvious downfall was incremental-ness of this. Seems like it'd be doable, but I'm not entirely certain on the best method. I was also considering this as our full-proof 100% reliable method for migrating data between different CouchDB versions which we seem to screw up fairly regularly. +1 on the idea. Not sure about raw couch files as it limits the wider usefulness (and we already have scp). On Tue, Aug 16, 2011 at 10:24 AM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints (the names don't matter, could be one with GET / PUT) that just ships verbatim .couch files over HTTP. It would be for admins only, it would not be incremental (although we might be able to add that), and I haven't yet thought through all the concurrency and error case implications, the above solves more than the proposed problem and in a very different, but I thought I throw it in the mix. Cheers Jan -- On Aug 16, 2011, at 5:08 PM, Robert Newson wrote: +1 on the intention but we'll need to be careful. The use case is specifically to allow verbatim migration of databases between servers. A separate role makes sense as I'm not sure of the consequences of explicitly granting this ability to the existing _admin role. B. On 16 August 2011 15:26, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
I understand the issue brought by Adam since in our CouchDb application, there is a need to have a replicator role and the validation functions skip most of the tests if the role is set for the current user. On the other hand, at the current time, I am not in favour of making super users for the sake of replication. Although it might solve the particular problem stated, it removes the ability for a design document to enforce some invariant properties of a database. Since there is already a way to allow a replicator to perform any changes (role + proper validation function), I do not see the need for this change. Since the super replicator user removes the ability that a database has to protect the consistency of its data, and that there does not seem to be a work-around, I would rather not see this change pushed to CouchDb. JP On 11-08-16 10:26 AM, Adam Kocoloski wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: Configuration Load Order
On 16 Aug 2011, at 02:20, Jason Smith wrote: Is it possible to deprecate the .ini files as a configuration tool? In other words, tell the world: Configure CouchDB over HTTP via the /_config URLs, probably via Futon. I think this proposal reaches too far. Having the configuration in the ini files is good for a number of reasons. It allows you to configure CouchDB without actually running CouchDB. Following on from that, it allows you to rescue a CouchDB instance that is misbehaving, even if you are unable to access CouchDB. It lets sysadmins version the files, perform audits, as well as allowing them to easily integrate CouchDB within automatic deployment and configuration systems. If there are certain types of things that people regularly want to do via CouchDB itself, such as URL handlers, or users, then I see no reason why this stuff shouldn't be moved to CouchDB itself. But this type of thing should probably be handled on a case-by-case basis. Anything which relates to the CouchDB server in a more general sense, should stay in the system configuration files.
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 1:10 PM, Adam Kocoloski kocol...@apache.org wrote: Wow, this thread got hijacked a bit :) You must be new here.
Re: Configuration Load Order
On 16 Aug 2011, at 10:33, Benoit Chesneau wrote: Imo we shouldn't at all provide plaintext passwords. Maybe a safer option would be to let the admin create the first one via http or put the hash in the a password.ini file manually. If we are enough kind we could also provide a couchctl script allowing user management, config changes ... ? This sounds like a decent proposal. Much like you have to use htpasswd to generate passwords for Apache httpd, we could bundle a script that lets you generate passwords for the CouchDB ini files, and then forbid the use of plaintext. This solves both the technical problem (I think?) and helps us re-enforce better security practices across the board.
Re: Configuration Load Order
On Aug 16, 2011, at 8:31 PM, Noah Slater wrote: On 16 Aug 2011, at 10:33, Benoit Chesneau wrote: Imo we shouldn't at all provide plaintext passwords. Maybe a safer option would be to let the admin create the first one via http or put the hash in the a password.ini file manually. If we are enough kind we could also provide a couchctl script allowing user management, config changes ... ? This sounds like a decent proposal. Much like you have to use htpasswd to generate passwords for Apache httpd, we could bundle a script that lets you generate passwords for the CouchDB ini files, and then forbid the use of plaintext. This solves both the technical problem (I think?) and helps us re-enforce better security practices across the board. Agreed. Cheers Jan --
Re: The replicator needs a superuser mode
Hi Jean-Pierre, I'm not quite sure I follow that line of reasoning. A user with _admin privileges on the database can easily remove any validation functions prior to writing today. In my proposal skipping validation would require _admin rights and an explicit opt-in on a per-request basis. What are you trying to guard against with those validation functions? Best, Adam On Aug 16, 2011, at 2:29 PM, Jean-Pierre Fiset wrote: I understand the issue brought by Adam since in our CouchDb application, there is a need to have a replicator role and the validation functions skip most of the tests if the role is set for the current user. On the other hand, at the current time, I am not in favour of making super users for the sake of replication. Although it might solve the particular problem stated, it removes the ability for a design document to enforce some invariant properties of a database. Since there is already a way to allow a replicator to perform any changes (role + proper validation function), I do not see the need for this change. Since the super replicator user removes the ability that a database has to protect the consistency of its data, and that there does not seem to be a work-around, I would rather not see this change pushed to CouchDb. JP On 11-08-16 10:26 AM, Adam Kocoloski wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: Bringing automatic compaction into trunk
Filipe is addressing Paul's concerns. As far as scanning vs. an evented architecture, I'd prefer to see Filipe's working code in place, and later replaced with a better alternative. We need to push the project forward, we value useful correct code first. It's easier to improve on it once it's in place. Also, I have no objections to a more modular architecture, I very much welcome it. But that work can happen concurrently with pushing forward the code and adding features the user community cares about. -Damien On Aug 16, 2011, at 7:00 AM, Robert Newson wrote: Ok, let's see Pauls' code concerns addressed first, it needs that cleanup before it can hit trunk. I'd still prefer to see an event-driven rather than polling approach, e.g, hook into update_notifier and build a queue of databases that are actively being written to (and therefore growing). A much lazier background thing could compact databases that are inactive. B. On 16 August 2011 14:48, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 3:44 PM, Robert Newson wrote: All good points Jan, thanks. Having large numbers of databases is one thing, but I'm focused on the impact on ongoing operations with this running in the background. What does it do to the users experience to have all dbs scanned periodically, etc? The reason I suggest doing it after the move, and in its own app, is to reduce the work needed to not use this code in some circumstances (Cloudant hosting, for example). Yes, it's a separate module and disabled by default, but putting it in its own application will make the separation much more explicit and preclude unintended entanglements with core over time. I think this is a valid concern, but I don't think it outweighs the disadvantage. I'm happy to spend time to make sure this is properly modular after srcmv. Cheers Jan -- B. On 16 August 2011 14:31, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 2:59 PM, Robert Newson wrote: I'm -1 on the approach (as I understand it) taken by the scheduler as it will be problematic in precisely the circumstance when you'd most want auto compaction (large numbers of databases and views). As Filipe mentions in the ticket, this was tested with large numbers of databases. In addition, your most want assumption doesn't hold for the average user, I'd wager (no numbers, alas). I'd say it's a basic user-experience plus that a software doesn't start wasting a system resource without cleaning up after itself. But this isn't even suggesting to enable this by default. We have plenty of other features that need proper documentation to be used correctly and that we are improving over time to make them more obvious by removing common errors or odd behaviour. To this point Just curious, would it make a big difference to commit the patch before srcmv and migrate it with the rest of the code base rather than letting it rot in JIRA and leave it all to Filipe to keep it updated. -- I'm -∞ on any suggestion that code should be put in trunk to stop it from rotting. Code should land when it's ready. I hope we're all agreed on that and that this paragraph was redundant. I was suggesting that the the patch is ready enough for trunk and that the level of readiness should not be solves all possible cases. Especially for something that is disabled by default. If we take this to the extreme, we'd never add any new features. I'm not suggesting it compiles for me, lets throw it into trunk. After srcmv, and then some work to OTP-ify each of the resultant subdirs, we should add this as a separate application. We might also mark it as beta in the first release to gather feedback from the community. I don't see how that is any different from adding it before srcmv and avoiding leaving the front-porting effort to a single person. Ideally we'd already have srcmv done, but we don't and I don't want to hold off progress for an architecture change. I'll be accused of 'stop energy' within nanoseconds of this post so I should end by saying I'm +1 on couchdb gaining the ability to automatically compact its databases and views in principle. :) Cheers Jan -- B. On 16 August 2011 13:19, Jan Lehnardt j...@apache.org wrote: Good points Robert, I replied inline and then hijacked the thread for a more general discussion, sorry about that :) On Aug 16, 2011, at 2:08 PM, Robert Dionne wrote: Filipe, This is neat, I can definitely see the utility of the approach. I do share the concerns expressed in other comments with respect to the use of the config file for per db compaction specs and the use of a compact_loop that waits on config change messages when the ets table is empty. I don't think it fully takes into account the use case of large numbers of small dbs and/or some very large dbs interspersed with a lot of mid-size dbs. As I seid in the ticket, per-db config
Re: [jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
On Tue, Aug 16, 2011 at 2:58 AM, Benoit Chesneau bchesn...@gmail.com wrote: Could be a local docs, But why didn't we took this path for this _security object ? Also since they are really meta informations, i've the feeling it should be solved as a special member in the db file, just like the _security object. I don't know why _security is like it is now, that predates me, and it's another topic :) Anyway what I really dislike is saving per db configuration in an ini file. Per db configuration should be done on the db. What if you more than 100 dbs. Having 100 lines in an ini file to parse is awkward. I don't think the common case is to have a separate compact config for every single database. The fragmentation parameter, which is likely the most useful, you're likely to not set a different value for 100 databases (neither the period for e.g.). For other things like the oauth tokens/secrets, the .ini system doesn't scale. But that's again another topic. This is just as simple as this line, creating a db create an entry in a db index (or db file) that you can use later. I suspect what you think is something like rather than scanning periodically, to let the daemon be notified when a db (or view) can be compacted? At some point I considered reacting to db_updated events but this was pretty much flooding the the event handler (daemon). Was this your idea? Using db events is my idea yes. If t actually flood the db event handler (not sure why), then maybe we should fix it first? The problem is when you have many dbs in the system and under a reasonable write load, the daemon (which is the receiver of db_updated events) receives too many messages. To know if you need to compact the db after such message, you need to open it, and opening it on every message is a big burden as well. I tried this on a system with 1024 databases being updated constantly. It also doesn't deal with the case on startup where if a db with a high fragmentation is not updated for a long period, it won't have compaction started. If someone can measure the current solution's impact and present another working alternative with a lower impact (and practical tests, not just theory) I would be the first one wanting to make the change asap. - benoit -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: The replicator needs a superuser mode
-1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. +0 on _dump/_load If it ships raw .couch files I'm totally against it because I think the HTTP API should remain as independent of implementation details as possible. If it is non-incremental I don't see significant benefit, unless it's just to traverse the document index and ignore the sequence index as a way to skip reads, but this seems like a weak argument. If it's incremental, well, then, that's replication, and we already have that. -Randall On Tue, Aug 16, 2011 at 11:40, Adam Kocoloski kocol...@apache.org wrote: Hi Jean-Pierre, I'm not quite sure I follow that line of reasoning. A user with _admin privileges on the database can easily remove any validation functions prior to writing today. In my proposal skipping validation would require _admin rights and an explicit opt-in on a per-request basis. What are you trying to guard against with those validation functions? Best, Adam On Aug 16, 2011, at 2:29 PM, Jean-Pierre Fiset wrote: I understand the issue brought by Adam since in our CouchDb application, there is a need to have a replicator role and the validation functions skip most of the tests if the role is set for the current user. On the other hand, at the current time, I am not in favour of making super users for the sake of replication. Although it might solve the particular problem stated, it removes the ability for a design document to enforce some invariant properties of a database. Since there is already a way to allow a replicator to perform any changes (role + proper validation function), I do not see the need for this change. Since the super replicator user removes the ability that a database has to protect the consistency of its data, and that there does not seem to be a work-around, I would rather not see this change pushed to CouchDb. JP On 11-08-16 10:26 AM, Adam Kocoloski wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
[jira] [Created] (COUCHDB-1251) Factor out couch core and hook other compnonents through a module system
Factor out couch core and hook other compnonents through a module system Key: COUCHDB-1251 URL: https://issues.apache.org/jira/browse/COUCHDB-1251 Project: CouchDB Issue Type: Umbrella Components: Build System, Database Core Reporter: Randall Leeds Fix For: 2.0 https://mail-archives.apache.org/mod_mbox/couchdb-dev/201108.mbox/browser -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: compaction plugin, auth handler, foo plugin couchdb core [resent]
On Tue, Aug 16, 2011 at 09:42, Paul Davis paul.joseph.da...@gmail.comwrote: On Tue, Aug 16, 2011 at 6:45 AM, Jan Lehnardt j...@apache.org wrote: Hi Benoit, thanks for raising this again. I think we have a good plan to get started but it wouldn't hurt to get a little more organised. I think the plan is as follows: 1. Move to git, this makes all the subsequent steps more easy. 2. srcmv, reorganising the source code so we are prepared to do all the things you mention and all the other things we talked about in the past :) 3. Profit. -- As for my wish list, all this post the git move: We could release 1.2 based off of current trunk + a few of the more useful JIRA patches that we haven't committed yet. After 1.2.x is branched, srcmv trunk and start the internal refactoring and pluginnifying and release 1.3 off that. At some point merging between before and after srcmv merging patches is going to be a pain, so I'd like to keep that time as short as possible and thus keep the differences between 1.2 and 1.3 (given that these are the border cases) as small as possible. Cheers Jan -- Early morning pre-caffeine but this sounds like a pretty good idea to my addled brain. As an experiment in JIRA usage I created an umbrella task for this. Please place tickets under this umbrella and we can start to break down the sub-tasks we need to actually get this work done. https://issues.apache.org/jira/browse/COUCHDB-1251 I set the due date as the 21st of December. Holiday season. This should give us enough time to get 1.2 out the door and make some real progress on these goals. Again, this is an experiment. Sorry for those of you who hate process, but I thought maybe injecting a bit of here would stop the flow of e-mails and focus us all collectively. -Randall
Re: Configuration Load Order
On Tue, Aug 16, 2011 at 11:33, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 8:31 PM, Noah Slater wrote: On 16 Aug 2011, at 10:33, Benoit Chesneau wrote: Imo we shouldn't at all provide plaintext passwords. Maybe a safer option would be to let the admin create the first one via http or put the hash in the a password.ini file manually. If we are enough kind we could also provide a couchctl script allowing user management, config changes ... ? This sounds like a decent proposal. Much like you have to use htpasswd to generate passwords for Apache httpd, we could bundle a script that lets you generate passwords for the CouchDB ini files, and then forbid the use of plaintext. This solves both the technical problem (I think?) and helps us re-enforce better security practices across the board. Agreed. Agreed also. We still have a question about load and save order. One idea would be to track the .ini file from whence an option came. If an option comes from a local.ini or local.d/ file it could be updated in place. If it comes from a default.ini or default.d/ file, updates should be placed in local.ini. This would make the most sense to me. I would also be in favor of enforcing a load order that supports a directory structure like: local.d/ 010-stuff.ini 020-others.ini We don't need to ship anything like that by default. I think right now we take the load directories on the command line, no? It'd be nice if the order of resolution within those directories was well specified. -Randall
Re: Configuration Load Order
nice idea to have a separate htpasswd (-like) file. Passwords are special, let's treat them accordingly. B. On 16 August 2011 23:03, Randall Leeds randall.le...@gmail.com wrote: On Tue, Aug 16, 2011 at 11:33, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 8:31 PM, Noah Slater wrote: On 16 Aug 2011, at 10:33, Benoit Chesneau wrote: Imo we shouldn't at all provide plaintext passwords. Maybe a safer option would be to let the admin create the first one via http or put the hash in the a password.ini file manually. If we are enough kind we could also provide a couchctl script allowing user management, config changes ... ? This sounds like a decent proposal. Much like you have to use htpasswd to generate passwords for Apache httpd, we could bundle a script that lets you generate passwords for the CouchDB ini files, and then forbid the use of plaintext. This solves both the technical problem (I think?) and helps us re-enforce better security practices across the board. Agreed. Agreed also. We still have a question about load and save order. One idea would be to track the .ini file from whence an option came. If an option comes from a local.ini or local.d/ file it could be updated in place. If it comes from a default.ini or default.d/ file, updates should be placed in local.ini. This would make the most sense to me. I would also be in favor of enforcing a load order that supports a directory structure like: local.d/ 010-stuff.ini 020-others.ini We don't need to ship anything like that by default. I think right now we take the load directories on the command line, no? It'd be nice if the order of resolution within those directories was well specified. -Randall
Re: compaction plugin, auth handler, foo plugin couchdb core [resent]
On Tue, Aug 16, 2011 at 2:59 PM, Randall Leeds As an experiment in JIRA usage I created an umbrella task for this. Please place tickets under this umbrella and we can start to break down the sub-tasks we need to actually get this work done. https://issues.apache.org/jira/browse/COUCHDB-1251 I set the due date as the 21st of December. Holiday season. This should give us enough time to get 1.2 out the door and make some real progress on these goals. Sounds like a good idea Randall. Thanks for it. Again, this is an experiment. Sorry for those of you who hate process, but I thought maybe injecting a bit of here would stop the flow of e-mails and focus us all collectively. -Randall -- Filipe David Manana, fdman...@gmail.com, fdman...@apache.org Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men.
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 4:46 PM, Randall Leeds randall.le...@gmail.com wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. +0 on _dump/_load If it ships raw .couch files I'm totally against it because I think the HTTP API should remain as independent of implementation details as possible. If it is non-incremental I don't see significant benefit, unless it's just to traverse the document index and ignore the sequence index as a way to skip reads, but this seems like a weak argument. If it's incremental, well, then, that's replication, and we already have that. Think of plain text backups and last resort upgrade paths. Also, it wouldn't have validation docs run on it or anything of that nature. I'm thinking basically of having a multipart/mime stream representation of the database that follows the update sequence. And the _dump would allow for a ?since= parameter that would make it incremental. This would even give people the ability to do daily logs and so on. -Randall On Tue, Aug 16, 2011 at 11:40, Adam Kocoloski kocol...@apache.org wrote: Hi Jean-Pierre, I'm not quite sure I follow that line of reasoning. A user with _admin privileges on the database can easily remove any validation functions prior to writing today. In my proposal skipping validation would require _admin rights and an explicit opt-in on a per-request basis. What are you trying to guard against with those validation functions? Best, Adam On Aug 16, 2011, at 2:29 PM, Jean-Pierre Fiset wrote: I understand the issue brought by Adam since in our CouchDb application, there is a need to have a replicator role and the validation functions skip most of the tests if the role is set for the current user. On the other hand, at the current time, I am not in favour of making super users for the sake of replication. Although it might solve the particular problem stated, it removes the ability for a design document to enforce some invariant properties of a database. Since there is already a way to allow a replicator to perform any changes (role + proper validation function), I do not see the need for this change. Since the super replicator user removes the ability that a database has to protect the consistency of its data, and that there does not seem to be a work-around, I would rather not see this change pushed to CouchDb. JP On 11-08-16 10:26 AM, Adam Kocoloski wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 16:23, Paul Davis paul.joseph.da...@gmail.comwrote: On Tue, Aug 16, 2011 at 4:46 PM, Randall Leeds randall.le...@gmail.com wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. +0 on _dump/_load If it ships raw .couch files I'm totally against it because I think the HTTP API should remain as independent of implementation details as possible. If it is non-incremental I don't see significant benefit, unless it's just to traverse the document index and ignore the sequence index as a way to skip reads, but this seems like a weak argument. If it's incremental, well, then, that's replication, and we already have that. Think of plain text backups and last resort upgrade paths. Also, it wouldn't have validation docs run on it or anything of that nature. I'm thinking basically of having a multipart/mime stream representation of the database that follows the update sequence. And the _dump would allow for a ?since= parameter that would make it incremental. This would even give people the ability to do daily logs and so on. Right-o. I don't feel strongly about it, like I said, and think it could be easily crafted as a plugin if we get *that* situation sorted out. How's my assessment of the need for a special role or validation skipping, though? Am I right that one could just create a smart validation function? -Randall On Tue, Aug 16, 2011 at 11:40, Adam Kocoloski kocol...@apache.org wrote: Hi Jean-Pierre, I'm not quite sure I follow that line of reasoning. A user with _admin privileges on the database can easily remove any validation functions prior to writing today. In my proposal skipping validation would require _admin rights and an explicit opt-in on a per-request basis. What are you trying to guard against with those validation functions? Best, Adam On Aug 16, 2011, at 2:29 PM, Jean-Pierre Fiset wrote: I understand the issue brought by Adam since in our CouchDb application, there is a need to have a replicator role and the validation functions skip most of the tests if the role is set for the current user. On the other hand, at the current time, I am not in favour of making super users for the sake of replication. Although it might solve the particular problem stated, it removes the ability for a design document to enforce some invariant properties of a database. Since there is already a way to allow a replicator to perform any changes (role + proper validation function), I do not see the need for this change. Since the super replicator user removes the ability that a database has to protect the consistency of its data, and that there does not seem to be a work-around, I would rather not see this change pushed to CouchDb. JP On 11-08-16 10:26 AM, Adam Kocoloski wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. I propose we add a role which allows a user to bypass validation, or else extend that privilege to the _admin role. We should still validate updates by default and add a way (a new qs param, for instance) to indicate that validation should be skipped for a particular update. Thoughts? Adam
Bug or my lack of understanding? Reduce output must shrink more rapidly
Hello, I have been able to reduce a complex case where a certain sized document within our application causes Reduce output must shrink more rapidly errors and I am not sure I understand why. I spent a great deal of time making sure I have stripped the database, the documents and the views to the bare minimum to make it easy to reproduce, I would really appreciate if anyone could give me some insight on what is causing this and if a fix exists, may it be ini settings etc. I apologize in advanced if this is my lack of understanding views or how they work as well as to this email being a bit long, but I think it is required to express the issue in case it is indeed a bug. Kind Regards, -Chris --Reproduce steps-- 1) CouchDB Production release 1.10 2) Create a fresh database 3) Create the following design document { _id: _design/test, _rev: 1-19eb11313c2602a00f0105f78202d1f3, views: { Grid: { map: function(doc) {\n emit(\result\, doc.data);\n}, reduce: function(keys, values, rereduce) {\n var container = {};\n\n if(!rereduce) {\nfor(var value in values) {\n for(var col in values[value]) {\nif(values[value]) {\n if(!container[col]) {\ncontainer[col] = {\n total: 0\n};\n }\n\n container[col].total++;\n}\n }\n}\n } else {\n for(var reduced in values) {\n for(var col in values[reduced]) {\nif(!container[col]) {\n container[col] = {\n total: 0\n };\n}\n\ncontainer[col].total += values[reduced][col].total;\n }\n}\n }\n\n return container;\n} } }, language: javascript } 4) Create the following regular document (any id is okay) { _id: 4334dff68f2283e6e8739eabb40a4e7a, _rev: 24-524e9c9ebeaf88962f41e3a940788610, data: { C003089: c1, C006990: c2, C009996: c3, C012132: c4, C015574: c5, C018908: c6, C021545: c7, C024392: c8, C027281: c9, C030392: c10, C033457: null, C036671: null, C039663: null, C042967: null, C045398: null, C048160: null, C051924: null, C054920: null, C057239: null, C060993: null, C063309: null, C066352: null, C069003: null, C072467: null, C075210: null } } 5) Call the view, just a typical call no arguments http://SERVER:5984/db_24/_design/test/_view/Grid 6) Verify the response is CORRECT {rows:[{key:null,value:{C003089:{total:1},C006990:{total:1},C009996:{total:1},C012132:{total:1},C015574:{total:1},C018908:{total:1},C021545:{total:1},C024392:{total:1},C027281:{total:1},C030392:{total:1},C033457:{total:1},C036671:{total:1},C039663:{total:1},C042967:{total:1},C045398: {total:1},C048160:{total:1},C051924:{total:1},C054920:{total:1},C057239:{total:1},C060993:{total:1},C063309:{total:1},C066352:{total:1},C069003: {total:1},C072467:{total:1},C075210:{total:1}}} ]} 7) Now, delete the previous document and add the following: { _id: 4334dff68f2283e6e8739eabb40a4e7a, _rev: 24-524e9c9ebeaf88962f41e3a940788610, data: { C003089: c1, C006990: c2, C009996: c3, C012132: c4, C015574: c5, C018908: c6, C021545: c7, C024392: c8, C027281: c9, C030392: c10, C033457: null, C036671: null, C039663: null, C042967: null, C045398: null, C048160: null, C051924: null, C054920: null, C057239: null, C060993: null, C063309: null, C066352: null, C069003: null, C072467: null, C075210: null, C078387: null } } 8) Note that all we did was add a single property to the end of data, now run the same view again 9) Notice the error: {error:reduce_overflow_error,reason:Reduce output must shrink more rapidly: Current output: '[{\C003089\:{\total\:1},\C006990\:{\total\:1},\C009996\:{\total\:1},\C012132\:{\total\:1},\C015574\:'... (first 100 of 575 bytes)} 10) I am confused because all I did is add a single property, not sure how this affects the reduce function?
Re: The replicator needs a superuser mode
On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, Adam
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 17:03, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, I understand the motivation a little better now. I'm not sure it's generally useful. I think _dump/_load might be, but I'd rather see users craft around validation as part of their replication strategy rather than increase the query option population. I'm not sure I'm against admin user context bypassing validation docs, though.
Re: Bug or my lack of understanding? Reduce output must shrink more rapidly
On Tue, Aug 16, 2011 at 17:03, Chris Stockton chrisstockto...@gmail.comwrote: Hello, I have been able to reduce a complex case where a certain sized document within our application causes Reduce output must shrink more rapidly errors and I am not sure I understand why. I spent a great deal of time making sure I have stripped the database, the documents and the views to the bare minimum to make it easy to reproduce, I would really appreciate if anyone could give me some insight on what is causing this and if a fix exists, may it be ini settings etc. I apologize in advanced if this is my lack of understanding views or how they work as well as to this email being a bit long, but I think it is required to express the issue in case it is indeed a bug. Kind Regards, -Chris --Reproduce steps-- 1) CouchDB Production release 1.10 2) Create a fresh database 3) Create the following design document { _id: _design/test, _rev: 1-19eb11313c2602a00f0105f78202d1f3, views: { Grid: { map: function(doc) {\n emit(\result\, doc.data);\n}, reduce: function(keys, values, rereduce) {\n var container = {};\n\n if(!rereduce) {\nfor(var value in values) {\n for(var col in values[value]) {\nif(values[value]) {\n if(!container[col]) {\ncontainer[col] = {\n total: 0\n};\n }\n\n container[col].total++;\n}\n }\n}\n } else {\n for(var reduced in values) {\n for(var col in values[reduced]) {\nif(!container[col]) {\n container[col] = {\n total: 0\n };\n}\n\ncontainer[col].total += values[reduced][col].total;\n }\n}\n }\n\n return container;\n} } }, language: javascript } 4) Create the following regular document (any id is okay) { _id: 4334dff68f2283e6e8739eabb40a4e7a, _rev: 24-524e9c9ebeaf88962f41e3a940788610, data: { C003089: c1, C006990: c2, C009996: c3, C012132: c4, C015574: c5, C018908: c6, C021545: c7, C024392: c8, C027281: c9, C030392: c10, C033457: null, C036671: null, C039663: null, C042967: null, C045398: null, C048160: null, C051924: null, C054920: null, C057239: null, C060993: null, C063309: null, C066352: null, C069003: null, C072467: null, C075210: null } } 5) Call the view, just a typical call no arguments http://SERVER:5984/db_24/_design/test/_view/Grid 6) Verify the response is CORRECT {rows:[{key:null,value:{C003089:{total:1},C006990:{total:1},C009996:{total:1},C012132:{total:1},C015574:{total:1},C018908:{total:1},C021545:{total:1},C024392:{total:1},C027281:{total:1},C030392:{total:1},C033457:{total:1},C036671:{total:1},C039663:{total:1},C042967:{total:1},C045398: {total:1},C048160:{total:1},C051924:{total:1},C054920:{total:1},C057239:{total:1},C060993:{total:1},C063309:{total:1},C066352:{total:1},C069003: {total:1},C072467:{total:1},C075210:{total:1}}} ]} 7) Now, delete the previous document and add the following: { _id: 4334dff68f2283e6e8739eabb40a4e7a, _rev: 24-524e9c9ebeaf88962f41e3a940788610, data: { C003089: c1, C006990: c2, C009996: c3, C012132: c4, C015574: c5, C018908: c6, C021545: c7, C024392: c8, C027281: c9, C030392: c10, C033457: null, C036671: null, C039663: null, C042967: null, C045398: null, C048160: null, C051924: null, C054920: null, C057239: null, C060993: null, C063309: null, C066352: null, C069003: null, C072467: null, C075210: null, C078387: null } } 8) Note that all we did was add a single property to the end of data, now run the same view again 9) Notice the error: {error:reduce_overflow_error,reason:Reduce output must shrink more rapidly: Current output: '[{\C003089\:{\total\:1},\C006990\:{\total\:1},\C009996\:{\total\:1},\C012132\:{\total\:1},\C015574\:'... (first 100 of 575 bytes)} 10) I am confused because all I did is add a single property, not sure how this affects the reduce function? Since you are collecting and creating keys in the output object creating this single property made the output of reduce larger. CouchDB tries to detect reduce functions that don't actually reduce the data. If you know for sure that you are working with a bounded set of properties whose occurrences you would like to sum you may set reduce_limit=false in your configuration. The default is true so that users don't shoot themselves in the foot (especially because you cannot cancel a run-away reduce if you don't have access to the machine to kill the process).
Re: Bug or my lack of understanding? Reduce output must shrink more rapidly
Hello, On Tue, Aug 16, 2011 at 5:37 PM, Randall Leeds randall.le...@gmail.com wrote: On Tue, Aug 16, 2011 at 17:03, Chris Stockton chrisstockto...@gmail.comwrote: Since you are collecting and creating keys in the output object creating this single property made the output of reduce larger. CouchDB tries to detect reduce functions that don't actually reduce the data. If you know for sure that you are working with a bounded set of properties whose occurrences you would like to sum you may set reduce_limit=false in your configuration. The default is true so that users don't shoot themselves in the foot (especially because you cannot cancel a run-away reduce if you don't have access to the machine to kill the process). Thanks Randall for your reply, I changed my view call to [1] and oddly it still gives the same error, maybe I am doing something wrong? I didn't see anywhere on couchdb wiki anything for reduce_limit. Although I think long term that kind of scares me a little bit, if for some reason we ran across some new data that caused a infinite reduce due to a bug, our couchdbs would all get crippled, do I have any other options here? It would be great if I could impose a size limit for reduce, or even a minimum size limit, as it is odd to trigger a reduce error on the first record, making it have to run at least 100 times should be a good test to see if the data is shrinking or at least remaining constant. Not sure what to suggest here beyond that, I just think it doesn't feel quite right, maybe someone has some better suggestion. [1] http://server:59841/db_24/_design/test/_view/Grid?reduce_limit=false
Re: Configuration Load Order
On Wed, Aug 17, 2011 at 5:03 AM, Randall Leeds randall.le...@gmail.com wrote: On Tue, Aug 16, 2011 at 11:33, Jan Lehnardt j...@apache.org wrote: On Aug 16, 2011, at 8:31 PM, Noah Slater wrote: On 16 Aug 2011, at 10:33, Benoit Chesneau wrote: Imo we shouldn't at all provide plaintext passwords. Maybe a safer option would be to let the admin create the first one via http or put the hash in the a password.ini file manually. If we are enough kind we could also provide a couchctl script allowing user management, config changes ... ? This sounds like a decent proposal. Much like you have to use htpasswd to generate passwords for Apache httpd, we could bundle a script that lets you generate passwords for the CouchDB ini files, and then forbid the use of plaintext. This solves both the technical problem (I think?) and helps us re-enforce better security practices across the board. Agreed. Agreed also. We still have a question about load and save order. One idea would be to track the .ini file from whence an option came. If an option comes from a local.ini or local.d/ file it could be updated in place. If it comes from a default.ini or default.d/ file, updates should be placed in local.ini. This would make the most sense to me. I would also be in favor of enforcing a load order that supports a directory structure like: local.d/ 010-stuff.ini 020-others.ini IMHO, this is madness. The American quip goes: the professor who never even ran for dog catcher presumes to tell the president how to do his job. Developers who spend all day in ./utils/run pontificate about good daemon behavior in an OS or distribution. (I don't *really* believe this. I know several of you are responsible for production couches, but that is the flash-bulb image in my mind.) I don't feel strongly on the matter, just want to share a sysadmin's perspective. Any of the proposals would be an improvement, so I'm net-happy. Some final apologist thoughts: My proposal is already implemented. Now I say promote HTTP config (Futon) over .ini files when possible. Integrators, packagers, and advanced sysadmins can attack the .ini files just as before. CouchDB stores versioned data, with a powerful validation and audit tool (potentially, I'm thinking about validate_doc_update and log()). Now we are invoking use cases of versioning the config, and auditing it. Wow! My point is not that the config (or some of it) should be in a database, but that the config should (1) *lose* complexity over time, not gain it; and (2) be deprecated as an implementation detail, or just for advanced users. Config files that change themselves are bizarre and scary. If that's what we've got, fine, but make it as simple as possible. Admins, passwords, and non-boostrappy configuration over HTTP seems more Couch-like, more of the web, and more relaxed. Take a MySQL admin, or an admin of Drupal, Wordpress, Moodle, Joomla, or pretty much any big PHP application. Tell them this: You have to get CouchDB up in the first place. So you edit some config files. Once it's up, you connect with your client/browser. It assumes you are an admin, and you complete installation over that interface. They would respond: Yeah, sure. I do not buy the misbehaving Couch scenario. Firstly, how common is that? After installation and confirmation, daemons get pretty stable. If a misconfiguration totally destroys the couch, well, they are still plain text. As before, load emacs and go for it! Finally, I am basically happy with the Couch config. It's quirky but not too bad. I only hope to share a fresh perspective: the viewpoint of people for whom couch is just another daemon, like MySQL or httpd or cron. -- Iris Couch
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 10:24 PM, Jan Lehnardt j...@apache.org wrote: This is only slightly related, but I'm dreaming of /db/_dump and /db/_restore endpoints Jan, I also had that dream at CouchOne, but now I think it is a very bad idea. A database is a URL. Every URL is different. Cloning URL_A to URL_B is tempting, but fundamentally anti-CouchDB. There is a reason the security object does not replicate. Every URL (or origin) is a different security environment, and it is meaningless or wrong to apply A's security object to B's database. Validation functions decide what to allow based on userCtx and secObj. Both of those change (generally) with the URL. Cloning one database to another IMO spits in the face of the architecture and philosophy of replication. IMHO, cloning a *database* is not desirable. Long-term, you really want to replicate a database. Cloning a *couch* (GET /_dump, PUT /_restore) would be awesome. That is the right abstraction level. Among other reasons, it can include the config. Maybe that is mission creep. -- Iris Couch
Re: Bug or my lack of understanding? Reduce output must shrink more rapidly
On Tue, Aug 16, 2011 at 17:53, Chris Stockton chrisstockto...@gmail.comwrote: Hello, On Tue, Aug 16, 2011 at 5:37 PM, Randall Leeds randall.le...@gmail.com wrote: On Tue, Aug 16, 2011 at 17:03, Chris Stockton chrisstockto...@gmail.com wrote: Since you are collecting and creating keys in the output object creating this single property made the output of reduce larger. CouchDB tries to detect reduce functions that don't actually reduce the data. If you know for sure that you are working with a bounded set of properties whose occurrences you would like to sum you may set reduce_limit=false in your configuration. The default is true so that users don't shoot themselves in the foot (especially because you cannot cancel a run-away reduce if you don't have access to the machine to kill the process). Thanks Randall for your reply, I changed my view call to [1] and oddly it still gives the same error, maybe I am doing something wrong? I didn't see anywhere on couchdb wiki anything for reduce_limit. Although I think long term that kind of scares me a little bit, if for some reason we ran across some new data that caused a infinite reduce due to a bug, our couchdbs would all get crippled, do I have any other options here? It would be great if I could impose a size limit for reduce, or even a minimum size limit, as it is odd to trigger a reduce error on the first record, making it have to run at least 100 times should be a good test to see if the data is shrinking or at least remaining constant. Not sure what to suggest here beyond that, I just think it doesn't feel quite right, maybe someone has some better suggestion. [1] http://server:59841/db_24/_design/test/_view/Grid?reduce_limit=false After this I'll tell you about how you change that setting, but you should consider restructuring your map/reduce: For example, instead of building an object with these counts in memory and trying to reduce them over reduce/rereduce just emit multiple rows. map: for (var col in doc) { emit(col, 1); } reduce: _sum This way you can use the built-in reduction by specifying just the string _sum as your reduce, which is much more efficient than doing it yourself. Also, you don't hit reduce limit. Anyway, in case you *do* work with your own installation and want to break the reduce limit sometime, here's how: If you look in default.ini you will see the section [query_server_config] with reduce_limit = true. You could put something like this in your local.ini: [query_server_config] reduce_limit = false If you don't have access to the box you should be able to issue: PUT http://server/_config/query_server_config/reduce_limit The body of the request should be the quoted json string false. For example, with cURL, you might do: curl -XPUT -HContent-Type: application/json -d'false' http:// server/_config/query_server_config/reduce_limit (Note that the data here is single and double quoted to ensure the double quotes are passed as part of the body and not removed by the shell.) If you get an error, e.g., because you're using IrisCouch or something other service which locks down the installation a bit, you'll have to contact their support.
Re: The replicator needs a superuser mode
On Aug 16, 2011, at 8:20 PM, Randall Leeds wrote: On Tue, Aug 16, 2011 at 17:03, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, I understand the motivation a little better now. I'm not sure it's generally useful. I think _dump/_load might be, but I'd rather see users craft around validation as part of their replication strategy rather than increase the query option population. I'm not sure I'm against admin user context bypassing validation docs, though. That's interesting. It sounds like you're motivated to minimize the surface area of the API. I can respect that. I'm not sure I like _admins automatically bypassing validation, though, because we already require _admin to update _design docs, so it's not as if we can make the use of _admin particularly rare. Will think on it. Best, Adam
[jira] [Updated] (COUCHDB-1246) CouchJS process spawned and not killed on each Reduce Overflow Error
[ https://issues.apache.org/jira/browse/COUCHDB-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Filipe Manana updated COUCHDB-1246: --- Attachment: os_pool_trunk.patch Paul, as we were discussing on IRC, I've a reproducible case where querying a view just hangs and we get os pool full, blocking subsequent requests. I attach here a wip patch, which adds an etap test (your patch is making this test fail). CouchJS process spawned and not killed on each Reduce Overflow Error Key: COUCHDB-1246 URL: https://issues.apache.org/jira/browse/COUCHDB-1246 Project: CouchDB Issue Type: Bug Components: JavaScript View Server Affects Versions: 1.1 Environment: Linux Debian Squeeze [query_server_config] reduce_limit = true os_process_limit = 25 Reporter: Michael Newman Attachments: COUCHDB-1246.patch, categories, os_pool_trunk.patch Running the view attached results in a reduce_overflow_error. For each reduce_overflow_error a process of /usr/lib/couchdb/bin/couchjs /usr/share/couchdb/server/main.js starts running. Once this gets to 25, which is the os_process_limit by default, all views result in a server error: timeout {gen_server,call,[couch_query_servers,{get_proc,javascript}]} As far as I can tell, these processes and the non-response from the views will continue until couch is restarted. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (COUCHDB-1246) CouchJS process spawned and not killed on each Reduce Overflow Error
[ https://issues.apache.org/jira/browse/COUCHDB-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086083#comment-13086083 ] Paul Joseph Davis commented on COUCHDB-1246: Filipe, Awesome work on the test. When I stared at it last I was contemplating writing a huge monolithic thing to setup a view and do all that crazy stuff. Knowing we can work down at the os process level should allow us to get a better handle on this. I'll try looking at this closer tonight or tomorrow. CouchJS process spawned and not killed on each Reduce Overflow Error Key: COUCHDB-1246 URL: https://issues.apache.org/jira/browse/COUCHDB-1246 Project: CouchDB Issue Type: Bug Components: JavaScript View Server Affects Versions: 1.1 Environment: Linux Debian Squeeze [query_server_config] reduce_limit = true os_process_limit = 25 Reporter: Michael Newman Attachments: COUCHDB-1246.patch, categories, os_pool_trunk.patch Running the view attached results in a reduce_overflow_error. For each reduce_overflow_error a process of /usr/lib/couchdb/bin/couchjs /usr/share/couchdb/server/main.js starts running. Once this gets to 25, which is the os_process_limit by default, all views result in a server error: timeout {gen_server,call,[couch_query_servers,{get_proc,javascript}]} As far as I can tell, these processes and the non-response from the views will continue until couch is restarted. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: The replicator needs a superuser mode
On 17 August 2011 02:47, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 8:20 PM, Randall Leeds wrote: On Tue, Aug 16, 2011 at 17:03, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, I understand the motivation a little better now. I'm not sure it's generally useful. I think _dump/_load might be, but I'd rather see users craft around validation as part of their replication strategy rather than increase the query option population. I'm not sure I'm against admin user context bypassing validation docs, though. That's interesting. It sounds like you're motivated to minimize the surface area of the API. I can respect that. I'm not sure I like _admins automatically bypassing validation, though, because we already require _admin to update _design docs, so it's not as if we can make the use of _admin particularly rare. Will think on it. Best, Adam Just to point out a very useful usecase for /_dump /_load endpoint, on mobile we need to ship preloaded data / applications, I originally curl'd design docs and PUT them on starteup, but the resulting files are large and startup time is slow, replicating isnt an option. Now we use .couch files to preload data, however all my stuff is in a hosted server where I dont have access to scp (I can just copy them down to servers where I can access .couch files, but speaking on behalf of new users / making things as easy as possible)
Re: The replicator needs a superuser mode
On Wed, Aug 17, 2011 at 7:03 AM, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, Adam, I'm conflicted. It feels presumptuous to disagree with you and the developers, which I've done a lot recently. Also, I too struggle with migrating data, verbatim, between servers (between couches, and also between Linux boxes). But to guarantee all my documents are stored in this other database is actually incoherent. It is IMHO anti-CouchDB. Validation functions, user accounts (which change from couch to couch), and security objects (which also change from db to db, and couch to couch) all come together to decide whether a change is approved (valid). That is very powerful, and very fundamental. Providing this guarantee betrays the promise that Couch makes to developers. People are using validation functions for government compliance, to meet regulatory requirements (SOX, HIPAA). IIRC, you are proposing a query parameter for Couch to disregard those instructions. Validation functions confirm not only authorization, but also well-formedness of the documents. So, again, in the real world, where many people use _admin accounts, adding a ?force=true parameter sounds dangerous. Do you worry whether, in the wild, people will use it more and more, like logging in to your workstation as root/Administrator? It eliminates daily annoyances but it is actually very risky behavior. Finally, yes, an admin can ultimately circumvent validation functions. But to me, that is the checks and balances of real life. If you forget your BIOS password, you can physically open the box and move a jumper. I do agree about the need to move opaque data around. I disagree that a query parameter should allow it. I feel the hosting provider pain. The customer creates _design/angry with validate_doc_update: function(newDoc, oldDoc, userCtx, secObj) { throw {forbidden: I am _design/angry and I hate all documents!}; } And now I am responsible for replicating their data, unmolested, all over the place. -- Iris Couch
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 9:26 PM, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. Somebody asked about this on Stack Overflow. It was a very simple but challenging question, but now I can't find it. Basically, he made your point above. Aren't you identifying two problems, though? 1. Sometimes you need to ignore validation to just make a nice, clean copy. 2. Replication batches (an optimization) are disobeying the change sequence, which can screw up the replica. I responded to #1 already. But my feeling about #2 is that the optimization goes too far. replication batches should always have boundaries immediately before and after design documents. In other words, batch all you want, but design documents [1] must always be in a batch size of 1. That will retain the semantics. [1] Actually, the only ddocs needing their own private batches are those with a validate_doc_update field. -- Iris Couch
Re: The replicator needs a superuser mode
On Aug 16, 2011, at 10:23 PM, Jason Smith wrote: On Wed, Aug 17, 2011 at 7:03 AM, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 5:46 PM, Randall Leeds wrote: -1 on _skip_validation and new role One can always write a validation document that considers the role, no? Why can't users who need this functionality craft a validation function for this purpose? This sounds like a blog post and not a database feature. Blech, really? Q: What request do I issue to guarantee all my documents are stored in this other database? A: Unpossible. Practically speaking we need it at Cloudant because we use replication to move users' databases between clusters. If it's not seen as generally useful that's ok, just surprising. Best, Adam, I'm conflicted. It feels presumptuous to disagree with you and the developers, which I've done a lot recently. Also, I too struggle with migrating data, verbatim, between servers (between couches, and also between Linux boxes). But to guarantee all my documents are stored in this other database is actually incoherent. It is IMHO anti-CouchDB. Hi Jason, we're going to have to disagree on this one. Replication is really flexible and can do lots of things that database replication has not historically been able to do, but I think it's a sad state of affairs that it's not possible to use replication to create a replica of an arbitrary database. Validation functions, user accounts (which change from couch to couch), and security objects (which also change from db to db, and couch to couch) all come together to decide whether a change is approved (valid). That is very powerful, and very fundamental. Providing this guarantee betrays the promise that Couch makes to developers. No, it doesn't. The guarantee presumes you have _admin access to the target database. Developers shouldn't give that out, just like they shouldn't give out root access to the server itself. People are using validation functions for government compliance, to meet regulatory requirements (SOX, HIPAA). IIRC, you are proposing a query parameter for Couch to disregard those instructions. Only if you have _admin access to the database, in which case you can already bypass validation or do whatever else you want to the data in that database if you're so inclined. Validation functions confirm not only authorization, but also well-formedness of the documents. So, again, in the real world, where many people use _admin accounts, adding a ?force=true parameter sounds dangerous. Well, yes, it would be dangerous to use on every request. Do you worry whether, in the wild, people will use it more and more, like logging in to your workstation as root/Administrator? It eliminates daily annoyances but it is actually very risky behavior. Meh. If they choose to bypass their own validation functions that's their concern. I don't lose sleep over it. Finally, yes, an admin can ultimately circumvent validation functions. But to me, that is the checks and balances of real life. If you forget your BIOS password, you can physically open the box and move a jumper. I do agree about the need to move opaque data around. I disagree that a query parameter should allow it. I feel the hosting provider pain. The customer creates _design/angry with validate_doc_update: function(newDoc, oldDoc, userCtx, secObj) { throw {forbidden: I am _design/angry and I hate all documents!}; } And now I am responsible for replicating their data, unmolested, all over the place. -- Iris Couch
[jira] [Updated] (COUCHDB-1246) CouchJS process spawned and not killed on each Reduce Overflow Error
[ https://issues.apache.org/jira/browse/COUCHDB-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Filipe Manana updated COUCHDB-1246: --- Attachment: os_pool_trunk.patch Thanks Paul. I'll probably be updating the patch one or two times until then. CouchJS process spawned and not killed on each Reduce Overflow Error Key: COUCHDB-1246 URL: https://issues.apache.org/jira/browse/COUCHDB-1246 Project: CouchDB Issue Type: Bug Components: JavaScript View Server Affects Versions: 1.1 Environment: Linux Debian Squeeze [query_server_config] reduce_limit = true os_process_limit = 25 Reporter: Michael Newman Attachments: COUCHDB-1246.patch, categories, os_pool_trunk.patch, os_pool_trunk.patch Running the view attached results in a reduce_overflow_error. For each reduce_overflow_error a process of /usr/lib/couchdb/bin/couchjs /usr/share/couchdb/server/main.js starts running. Once this gets to 25, which is the os_process_limit by default, all views result in a server error: timeout {gen_server,call,[couch_query_servers,{get_proc,javascript}]} As far as I can tell, these processes and the non-response from the views will continue until couch is restarted. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: The replicator needs a superuser mode
On Aug 16, 2011, at 10:31 PM, Jason Smith wrote: On Tue, Aug 16, 2011 at 9:26 PM, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. Somebody asked about this on Stack Overflow. It was a very simple but challenging question, but now I can't find it. Basically, he made your point above. Aren't you identifying two problems, though? 1. Sometimes you need to ignore validation to just make a nice, clean copy. 2. Replication batches (an optimization) are disobeying the change sequence, which can screw up the replica. As far as I know the only reason one needs to ignore validation to make a nice clean copy is because the replicator does not guarantee the updates are applied on the target in the order they were received on the source. It's all one issue to me. I responded to #1 already. But my feeling about #2 is that the optimization goes too far. replication batches should always have boundaries immediately before and after design documents. In other words, batch all you want, but design documents [1] must always be in a batch size of 1. That will retain the semantics. [1] Actually, the only ddocs needing their own private batches are those with a validate_doc_update field. My standard retort to transaction boundaries is that there is no global ordering of events in a distributed system. A clustered CouchDB can try to build a vector clock out of the change sequences of the individual servers and stick to that merged sequence during replication, but even then the ddoc entry in the feed could be concurrent with several other updates. I rather like that the replicator aggressively mixes up the ordering of updates because it prevents us from making choices in the single-server case that aren't sensible in a cluster. By the way, I don't consider this line of discussion presumptuous in the least. Cheers, Adam
[jira] [Commented] (COUCHDB-1153) Database and view index compaction daemon
[ https://issues.apache.org/jira/browse/COUCHDB-1153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13086096#comment-13086096 ] Filipe Manana commented on COUCHDB-1153: Paul, I'm addressing all the concerns pointed before. Some of them are already done and be tracked in individual commits: https://github.com/fdmanana/couchdb/commits/compaction_daemon The thing I'm not 100% sure is how to make the config and load/start of os_mon. I'm not that familiar with all those OTP structuring details. I've come up with this so far: http://friendpaste.com/R43WflJ8r75MupvXuS98v Using the args_file you mentioned makes sense, but we already have some stuff that could be moved into that new file, so I think it should go into a separate change. I'll help/do it, just need to figure out exactly how to do it and integrate into the build system / startup scripts. Database and view index compaction daemon - Key: COUCHDB-1153 URL: https://issues.apache.org/jira/browse/COUCHDB-1153 Project: CouchDB Issue Type: New Feature Environment: trunk Reporter: Filipe Manana Assignee: Filipe Manana Priority: Minor Labels: compaction I've recently written an Erlang process to automatically compact databases and they're views based on some configurable parameters. These parameters can be global or per database and are: minimum database fragmentation, minimum view fragmentation, allowed period and strict_window (whether an ongoing compaction should be canceled if it doesn't finish within the allowed period). These fragmentation values are based on the recently added data_size parameter to the database and view group information URIs (COUCHDB-1132). I've documented the .ini configuration, as a comment in default.ini, which I paste here: [compaction_daemon] ; The delay, in seconds, between each check for which database and view indexes ; need to be compacted. check_interval = 60 ; If a database or view index file is smaller then this value (in bytes), ; compaction will not happen. Very small files always have a very high ; fragmentation therefore it's not worth to compact them. min_file_size = 131072 [compactions] ; List of compaction rules for the compaction daemon. ; The daemon compacts databases and they're respective view groups when all the ; condition parameters are satisfied. Configuration can be per database or ; global, and it has the following format: ; ; database_name = parameter=value [, parameter=value]* ; _default = parameter=value [, parameter=value]* ; ; Possible parameters: ; ; * db_fragmentation - If the ratio (as an integer percentage), of the amount ; of old data (and its supporting metadata) over the database ; file size is equal to or greater then this value, this ; database compaction condition is satisfied. ; This value is computed as: ; ; (file_size - data_size) / file_size * 100 ; ; The data_size and file_size values can be obtained when ; querying a database's information URI (GET /dbname/). ; ; * view_fragmentation - If the ratio (as an integer percentage), of the amount ;of old data (and its supporting metadata) over the view ;index (view group) file size is equal to or greater then ;this value, then this view index compaction condition is ;satisfied. This value is computed as: ; ;(file_size - data_size) / file_size * 100 ; ;The data_size and file_size values can be obtained when ;querying a view group's information URI ;(GET /dbname/_design/groupname/_info). ; ; * period - The period for which a database (and its view groups) compaction ;is allowed. This value must obey the following format: ; ;HH:MM - HH:MM (HH in [0..23], MM in [0..59]) ; ; * strict_window - If a compaction is still running after the end of the allowed ; period, it will be canceled if this parameter is set to yes. ; It defaults to no and it's meaningful only if the *period* ; parameter is also specified. ; ; * parallel_view_compaction - If set to yes, the database and its views are ; compacted in parallel. This is only useful on ; certain setups, like for example when the database ; and view index directories point to different ; disks. It defaults to no.
Re: The replicator needs a superuser mode
On Wed, Aug 17, 2011 at 9:49 AM, Adam Kocoloski kocol...@apache.org wrote: On Aug 16, 2011, at 10:31 PM, Jason Smith wrote: On Tue, Aug 16, 2011 at 9:26 PM, Adam Kocoloski kocol...@apache.org wrote: One of the principal uses of the replicator is to make this database look like that one. We're unable to do that in the general case today because of the combination of validation functions and out-of-order document transfers. It's entirely possible for a document to be saved in the source DB prior to the installation of a ddoc containing a validation function that would have rejected the document, for the replicator to install the ddoc in the target DB before replicating the other document, and for the other document to then be rejected by the target DB. Somebody asked about this on Stack Overflow. It was a very simple but challenging question, but now I can't find it. Basically, he made your point above. Aren't you identifying two problems, though? 1. Sometimes you need to ignore validation to just make a nice, clean copy. 2. Replication batches (an optimization) are disobeying the change sequence, which can screw up the replica. As far as I know the only reason one needs to ignore validation to make a nice clean copy is because the replicator does not guarantee the updates are applied on the target in the order they were received on the source. It's all one issue to me. I responded to #1 already. But my feeling about #2 is that the optimization goes too far. replication batches should always have boundaries immediately before and after design documents. In other words, batch all you want, but design documents [1] must always be in a batch size of 1. That will retain the semantics. [1] Actually, the only ddocs needing their own private batches are those with a validate_doc_update field. My standard retort to transaction boundaries is that there is no global ordering of events in a distributed system. A clustered CouchDB can try to build a vector clock out of the change sequences of the individual servers and stick to that merged sequence during replication, but even then the ddoc entry in the feed could be concurrent with several other updates. I rather like that the replicator aggressively mixes up the ordering of updates because it prevents us from making choices in the single-server case that aren't sensible in a cluster. That is interesting. So if it is crucial that an application enforce transaction semantics, then that application can go ahead and understand the distribution architecture, and it can confirm that a ddoc is committed and distributed among all nodes, and then it can make subsequent changes or replications. Or, written as a dialogue: Developer: My application knows or cares that Couch is distributed. Developer: My application depends on a validation function applying universally. Developer. But my application won't bother to confirm that it's been fully pushed before I make changes or replications. Adam: WTF? Snark aside, it's an excellent point. Thanks. -- Iris Couch
Re: The replicator needs a superuser mode
tl;dr response here, philosophical musings below. 1. The requirements are real, it's reasonable to want to copy from A to B 2. Replication is a whole worldview, adding ?force=true breaks that worldview 3. Dump and restore sounds more appropriate On Wed, Aug 17, 2011 at 9:34 AM, Adam Kocoloski kocol...@apache.org wrote: But to guarantee all my documents are stored in this other database is actually incoherent. It is IMHO anti-CouchDB. Hi Jason, we're going to have to disagree on this one. Replication is really flexible and can do lots of things that database replication has not historically been able to do, but I think it's a sad state of affairs that it's not possible to use replication to create a replica of an arbitrary database. True. I agree with the requirements, but the solution raises a red flag. My understanding of couch: There is no such thing as a database (or data set) clone. There is no such thing as a database copy. There is no such thing as two databases with the same document. It's like Pauli's exclusion principle. Sure, maybe the doc and rev history are the same, but the _security object, the authentication environment, and the URI are different. That (generally) affects how applications and validation works. Put another way, this idea is a leaky abstraction. I much prefer Jan's _dump and _restore idea. It has some difficulties, but it is *not* replication. It's something totally different. In the universe of a database, replication always follows the rules. In the universe of a Couch, sure, sometimes you need to clone data around. There's an appropriate action for each abstraction layer. The nice thing about _dump and _restore, and also rsync, is that you make full, opaque clones (not replicas!). You can't merge or splice data sets. Once you are talking about merging data, or pulling out a subset, now you are in database land, not couch land, and you have to follow the rules of replication. -- Iris Couch
Re: The replicator needs a superuser mode
On Tue, Aug 16, 2011 at 20:37, Jason Smith j...@iriscouch.com wrote: The nice thing about _dump and _restore, and also rsync, is that you make full, opaque clones (not replicas!). You can't merge or splice data sets. Once you are talking about merging data, or pulling out a subset, now you are in database land, not couch land, and you have to follow the rules of replication. Yeah, this is what I'm thinking, too. Except I'd reverse couch and database :)
[jira] [Updated] (COUCHDB-1246) CouchJS process spawned and not killed on each Reduce Overflow Error
[ https://issues.apache.org/jira/browse/COUCHDB-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Filipe Manana updated COUCHDB-1246: --- Attachment: os_pool_trunk.patch CouchJS process spawned and not killed on each Reduce Overflow Error Key: COUCHDB-1246 URL: https://issues.apache.org/jira/browse/COUCHDB-1246 Project: CouchDB Issue Type: Bug Components: JavaScript View Server Affects Versions: 1.1 Environment: Linux Debian Squeeze [query_server_config] reduce_limit = true os_process_limit = 25 Reporter: Michael Newman Attachments: COUCHDB-1246.patch, categories, os_pool_trunk.patch, os_pool_trunk.patch, os_pool_trunk.patch Running the view attached results in a reduce_overflow_error. For each reduce_overflow_error a process of /usr/lib/couchdb/bin/couchjs /usr/share/couchdb/server/main.js starts running. Once this gets to 25, which is the os_process_limit by default, all views result in a server error: timeout {gen_server,call,[couch_query_servers,{get_proc,javascript}]} As far as I can tell, these processes and the non-response from the views will continue until couch is restarted. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira