date:20090220



On 20 Feb 2009, at 02:34, Shaun Lindsay wrote:


Hi all,
So, a couple months ago we implemented almost exactly the couch
clustering/partitioning solution described below.


Shaun, this sounds fantastic! :) I hope you can release the code for
this.

Cheers
Jan
--




The couch cluster (which
we called 'The Lounge') sits behind nginx running a custom module  
that farms

out the GETs and PUTs to the appropriate node/shard and the views to a
python proxy daemon which handles reducing the view results from the
individual shards and returning the full view.  We have replication  
working
between the cluster nodes so the shards exist multiple places and,  
in the
case of one of the nodes going down, the various proxies fail over  
to the

backup shards.

This clustering setup has been running in full production for  
several months

now with minimal problems.

We're looking to release all the code back to the community, but we  
need to
clear it with our legal team first to make sure we're not  
compromising any

of our more business-specific, proprietary code.

In total, we have:
a nginx module specifically set up for sharding databases
a 'smartproxy', written in Python/Twisted, for sharding views
and a few other ancillary pieces (replication notification, view  
updating,

etc)

Mainly, I just wanted to keep people from duplicating the work we've  
done --
hopefully we can release something back to the community in the next  
several

weeks.

We're having a meeting tomorrow morning to figure out what we can  
release
right now (probably the nginx module, at the least).  I'll let  
everyone know

what out timeline looks like.

--Shaun Lindsay
Meebo.com

On Thu, Feb 19, 2009 at 4:48 PM, Chris Anderson jch...@apache.org  
wrote:


On Thu, Feb 19, 2009 at 4:35 PM, Ben Browning ben...@gmail.com  
wrote:

So, I started thinking about partitioning with CouchDB and realized
that since views are just map/reduce, we can do some magic that's
harder if not impossible with other database systems. The idea in a
nutshell is to create a proxy that sits in front of multiple servers
and sprays the view queries to all servers, merging the results -
hence CouchSpray. This would give us storage and processing
scalability and could, with some extra logic, provide data  
redundancy

and failover.


There are plans in CouchDB's future to take care of data  
partitioning,

as well as querying views from a cluster. Theoretically, it should be
pretty simple. There are a few small projects that have started down
the road of writing code in this area.

https://code.launchpad.net/~dreid/sectional/trunk

Sectional is an Erlang http proxy that implements consistent hashing
for docs. I'm not sure how it handles view queries.

There's also a project to provide partitioning around the basic
key/value PUT and GET store using Nginx:

http://github.com/dysinger/nginx/tree/nginx_upstream_hash

If you're interested in digging into this stuff, please join d...@. We
plan to include clustering in CouchDB, so if you're interested in
implementing it, we could use your help.

Chris

--
Chris Anderson
http://jchris.mfdz.com

Re: Partitioned Clusters

2009-02-20 Thread Robert Newson

Any thoughts as to how (or even if) this tree-wise result aggregation
would work for externals?

I'm thinking specifically about couchdb-lucene, where multi-node
results aggregation is possible, given a framework like you propose
here. The results that couchdb-lucene produces can already be
aggregated, assuming there's a hook for the merge function (actually,
perhaps it's exactly reduce-shaped)...

B.

On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson jch...@apache.org wrote:
 On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning ben...@gmail.com wrote:
 Overall the model sounds very similar to what I was thinking. I just
 have a few comments.

 In this model documents are saved to a leaf node depending on a hash
 of the docid. This means that lookups are easy, and need only to touch
 the leaf node which holds the doc. Redundancy can be provided by
 maintaining R replicas of every leaf node.

 There are several use-cases where a true hash of the docid won't be the
 optimal partitioning key. The simple case is where you want to partition
 your data by user and in most non-trivial cases you won't be storing
 all of a user's data under one document with the user's id as the docid.
 A fairly simple solution would be allowing the developer to specify a 
 javascript
 function somewhere (not sure where this should live...) that takes a docid 
 and
 spits out a partition key. Then I could just prefix all my doc ids for
 a specific user
 with that user's id and write the appropriate partition function.


 View queries, on the other hand, must be handled by every node. The
 requests are proxied down the tree to leaf nodes, which respond
 normally. Each proxy node then runs a merge sort algorithm (which can
 sort in constant space proportional to # of input streams) on the view
 results. This can happen recursively if the tree is deep.

 If the developer has control over partition keys as suggested above, it's
 entirely possible to have applications where view queries only need data
 from one partition. It would be great if we could do something smart here or
 have a way for the developer to indicate to Couch that all the data should
 be on only one partition.

 These are just nice-to-have features and the described cluster setup could
 still be extremely useful without them.

 I think they are both sensible optimizations. Damien's described the
 JS partition function before on IRC, so I think it fits into the
 model. As far as restricting view queries to just those docs within a
 particular id range, it might make sense to partition by giving each
 user their own database, rather than logic on the docid. In the case
 where you need data in a single db, but still have some queries that
 can be partitioned, its still a good optimization. Luckily even in the
 unoptimized case, if a node has no rows to contribute to the final
 view result than it should have a low impact on total resources needed
 to generate the result.

 Chris

 --
 Chris Anderson
 http://jchris.mfdz.com

Re: Stats Patch API Discussion



On 20 Feb 2009, at 03:31, Paul Davis wrote:


I manged to clean up the errors by removing the two calls to
couch_stats_[collector|aggregator]:stop() in couch_server_sup.erl at
about line 35.

Not at all certain if that's the appropriate method, but near as I can
tell it mimics the other secondary services.



Thanks, you've seen from the comment, that this was a rather hard
way to do this :) We needed, for testing, a way to reset statistics
and figured, if we call reloadServer() in the testsuite, we should
nuke all stats. Since the tests will be redone anyway, this is no
longer needed, I've updated the branch accordingly.

Cheers
Jan
--







HTH,
Paul Davis

On Thu, Feb 19, 2009 at 8:51 PM, Robert Dionne
dio...@dionne-associates.com wrote:
Im seeing the same kinds of errors. The suite ran once from Futon  
but tends

to fail on delayed_commits. It fails using runner also.

But it is working. I'm adding some functions to erl-couch to call  
stats from
erlang clients. I've tried _stats and _stats/couchdb/request_time.  
I'm

getting numbers back.

It's very clean and the Eunit stuff is clearly a big win.






On Feb 19, 2009, at 8:16 PM, Chris Anderson wrote:


Having trouble with the build. Or, rather, it builds just fine, but
even the basics test seems to be failing, with lots of stuff like  
this

in the logfile:

[error] [0.43.0] {error_report,0.21.0,
{0.43.0,supervisor_report,
 [{supervisor,{local,couch_secondary_services}},
  {errorContext,child_terminated},
  {reason,normal},
  {offender,[{pid,0.110.0},
 {name,stats_collector},
 {mfa,{couch_stats_collector,start,[]}},
 {restart_type,permanent},
 {shutdown,brutal_kill},
 {child_type,worker}]}]}}
[error] [0.43.0] {error_report,0.21.0,
{0.43.0,supervisor_report,
 [{supervisor,{local,couch_secondary_services}},
  {errorContext,child_terminated},
  {reason,normal},
  {offender,[{pid,0.111.0},
 {name,stats_aggregator},
 {mfa,{couch_stats_aggregator,start,[]}},
 {restart_type,permanent},
 {shutdown,brutal_kill},
 {child_type,worker}]}]}}

It really could be just me. Anyone else tried to run the tests on  
the

current version of the branch?

http://github.com/janl/couchdb/tree/old-stats-new

I'm know I ran `make clean` and rebootstrapped. I'm launching  
CouchDB

with `make dev  utils/run`. It seems like Couch is running just
fine, but eventually the failure of couch_stats_* to boot properly  
is

causing it to drop http requests.

The code looks clean and well documented, so this should be easy to
fix, or maybe it's something on my end. It'd be helpful to hear if  
it

works for others.

Chris

--
Chris Anderson
http://jchris.mfdz.com

Re: Stats Patch API Discussion



On 20 Feb 2009, at 02:51, Robert Dionne wrote:

Im seeing the same kinds of errors. The suite ran once from Futon  
but tends to fail on delayed_commits. It fails using runner also.


But it is working. I'm adding some functions to erl-couch to call  
stats from erlang clients. I've tried _stats and _stats/couchdb/ 
request_time. I'm getting numbers back.


Hi Robert,

try running the delayed commits test a number of times.
I found it to be an unstable test that sometimes fails, but
mostly succeeds, even on trunk.



It's very clean and the Eunit stuff is clearly a big win.


Thanks! :)

Cheers
Jan
--










On Feb 19, 2009, at 8:16 PM, Chris Anderson wrote:


Having trouble with the build. Or, rather, it builds just fine, but
even the basics test seems to be failing, with lots of stuff like  
this

in the logfile:

[error] [0.43.0] {error_report,0.21.0,
 {0.43.0,supervisor_report,
  [{supervisor,{local,couch_secondary_services}},
   {errorContext,child_terminated},
   {reason,normal},
   {offender,[{pid,0.110.0},
  {name,stats_collector},
  {mfa,{couch_stats_collector,start,[]}},
  {restart_type,permanent},
  {shutdown,brutal_kill},
  {child_type,worker}]}]}}
[error] [0.43.0] {error_report,0.21.0,
 {0.43.0,supervisor_report,
  [{supervisor,{local,couch_secondary_services}},
   {errorContext,child_terminated},
   {reason,normal},
   {offender,[{pid,0.111.0},
  {name,stats_aggregator},
  {mfa,{couch_stats_aggregator,start,[]}},
  {restart_type,permanent},
  {shutdown,brutal_kill},
  {child_type,worker}]}]}}

It really could be just me. Anyone else tried to run the tests on the
current version of the branch?

http://github.com/janl/couchdb/tree/old-stats-new

I'm know I ran `make clean` and rebootstrapped. I'm launching CouchDB
with `make dev  utils/run`. It seems like Couch is running just
fine, but eventually the failure of couch_stats_* to boot properly is
causing it to drop http requests.

The code looks clean and well documented, so this should be easy to
fix, or maybe it's something on my end. It'd be helpful to hear if it
works for others.

Chris

--
Chris Anderson
http://jchris.mfdz.com

Re: Partitioned Clusters

Hi, I thought I'd introduce myself since I'm new here on the couchdb
list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
Akamai, Operations RD at Citrix Online, and I'm nearly done with a
PhD in computer networking at the moment. So I guess I've thought
about this kind of stuff a bit ;-)

I'm curious what the motivation behind a tree topology is. Not that
it's not a viable approach, just why that and not a load-balancer in
front of a bunch of leaves with lateral propagation between the
leaves? Why should the load-balancing/proxying/caching node even be
running couchdb?

One reason I can see for a tree topology would be the hierarchical
cache effect. But that would likely only make sense in certain
circumstances. Being able to configure the topology to meet various
needs, rather than enforcing one particular topology makes more sense
to me overall.

On 2/20/09, Robert Newson robert.new...@gmail.com wrote:
 Any thoughts as to how (or even if) this tree-wise result aggregation
 would work for externals?

 I'm thinking specifically about couchdb-lucene, where multi-node
 results aggregation is possible, given a framework like you propose
 here. The results that couchdb-lucene produces can already be
 aggregated, assuming there's a hook for the merge function (actually,
 perhaps it's exactly reduce-shaped)...

 B.

 On Fri, Feb 20, 2009 at 3:12 AM, Chris Anderson jch...@apache.org wrote:
 On Thu, Feb 19, 2009 at 6:39 PM, Ben Browning ben...@gmail.com wrote:
 Overall the model sounds very similar to what I was thinking. I just
 have a few comments.

 In this model documents are saved to a leaf node depending on a hash
 of the docid. This means that lookups are easy, and need only to touch
 the leaf node which holds the doc. Redundancy can be provided by
 maintaining R replicas of every leaf node.

 There are several use-cases where a true hash of the docid won't be the
 optimal partitioning key. The simple case is where you want to partition
 your data by user and in most non-trivial cases you won't be storing
 all of a user's data under one document with the user's id as the docid.
 A fairly simple solution would be allowing the developer to specify a
 javascript
 function somewhere (not sure where this should live...) that takes a
 docid and
 spits out a partition key. Then I could just prefix all my doc ids for
 a specific user
 with that user's id and write the appropriate partition function.


 View queries, on the other hand, must be handled by every node. The
 requests are proxied down the tree to leaf nodes, which respond
 normally. Each proxy node then runs a merge sort algorithm (which can
 sort in constant space proportional to # of input streams) on the view
 results. This can happen recursively if the tree is deep.

 If the developer has control over partition keys as suggested above, it's
 entirely possible to have applications where view queries only need data
 from one partition. It would be great if we could do something smart here
 or
 have a way for the developer to indicate to Couch that all the data
 should
 be on only one partition.

 These are just nice-to-have features and the described cluster setup
 could
 still be extremely useful without them.

 I think they are both sensible optimizations. Damien's described the
 JS partition function before on IRC, so I think it fits into the
 model. As far as restricting view queries to just those docs within a
 particular id range, it might make sense to partition by giving each
 user their own database, rather than logic on the docid. In the case
 where you need data in a single db, but still have some queries that
 can be partitioned, its still a good optimization. Luckily even in the
 unoptimized case, if a node has no rows to contribute to the final
 view result than it should have a low impact on total resources needed
 to generate the result.

 Chris

 --
 Chris Anderson
 http://jchris.mfdz.com

Re: Using HTTP headers

2009-02-20 Thread Dave Bordoley

You can do it via XMLHttpRequest. Not sure if all JS libs support it
but YUI does. No support via Forms though.

dave

On Thu, Feb 19, 2009 at 3:38 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 Do browsers allow setting of custom headers? I'm fairly certain they don't,
 meaning that any control of CouchDB accomplished that way would be
 unavailable to pure CouchApps that run in a browser. That seems like a major
 design limitation unless I'm missing something.

 On Thu, Feb 19, 2009 at 3:19 PM, Chris Anderson jch...@apache.org wrote:

 On Thu, Feb 19, 2009 at 2:49 PM, Noah Slater nsla...@apache.org wrote:
  On Thu, Feb 19, 2009 at 05:46:20PM -0500, Paul Davis wrote:
  My only real point is that the whole issue is rather gray and we
  should look around to see if maybe there's already a proposed header
  of similar intent. The cache control was just me trying to make the
  point that this is mostly just the product of slight differences in
  interpretation. As clearly  demonstrated by the length and content of
  the thread. :D
 
  I poked around the WebDAV RFC but found nothing of note.
 

 Full-Commit clearly doesn't belong in the request body (that's where
 the doc goes) and it doesn't quite fit in the resource identifier
 either.

 There's not much left but the headers...
 http://tools.ietf.org/html/rfc3864 doesn't look like too much trouble,
 and then we'd   be playing by the rules.


 --
 Chris Anderson
 http://jchris.mfdz.com

Re: Partitioned Clusters

On Fri, Feb 20, 2009 at 10:55 AM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 Hi, I thought I'd introduce myself since I'm new here on the couchdb
 list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
 Akamai, Operations RD at Citrix Online, and I'm nearly done with a
 PhD in computer networking at the moment. So I guess I've thought
 about this kind of stuff a bit ;-)

Glad to have you with us. :)


 I'm curious what the motivation behind a tree topology is. Not that
 it's not a viable approach, just why that and not a load-balancer in
 front of a bunch of leaves with lateral propagation between the
 leaves? Why should the load-balancing/proxying/caching node even be
 running couchdb?

The reason to write the proxies as Erlang, is that they can avoid the
JSON and HTTP overhead until the final stage, as well as use Erlang's
inter-node communication and process management mojo.

The tree structure also provides a nice mapping onto the existing
reduce implementation. Inner nodes can store the reduction values for
their leaf nodes and run the reduce function to come up with total
values.


 One reason I can see for a tree topology would be the hierarchical
 cache effect. But that would likely only make sense in certain
 circumstances. Being able to configure the topology to meet various
 needs, rather than enforcing one particular topology makes more sense
 to me overall.

I agree - as Ben points out the flat topology is just a special case
of the tree (and would probably be ideal for anything less than
hundreds of nodes).

I'm not an expert on cluster layout, but the tree structure appeals to
me mostly because changes to subtrees don't need to be propagated to
the cluster root.

That said, there's *plenty* that can be done with HTTP proxies (and
probably implemented more quickly) so it's probably the best way to
prototype any of these implementations.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Using HTTP headers

Ok, that seems reasonable then. I just checked and Flash also supports it,
so that covers the two major categories of RIA applications. Carry on...

On Fri, Feb 20, 2009 at 11:13 AM, Dave Bordoley bordo...@gmail.com wrote:

 You can do it via XMLHttpRequest. Not sure if all JS libs support it
 but YUI does. No support via Forms though.

 dave

 On Thu, Feb 19, 2009 at 3:38 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
  Do browsers allow setting of custom headers? I'm fairly certain they
 don't,
  meaning that any control of CouchDB accomplished that way would be
  unavailable to pure CouchApps that run in a browser. That seems like a
 major
  design limitation unless I'm missing something.
 
  On Thu, Feb 19, 2009 at 3:19 PM, Chris Anderson jch...@apache.org
 wrote:
 
  On Thu, Feb 19, 2009 at 2:49 PM, Noah Slater nsla...@apache.org
 wrote:
   On Thu, Feb 19, 2009 at 05:46:20PM -0500, Paul Davis wrote:
   My only real point is that the whole issue is rather gray and we
   should look around to see if maybe there's already a proposed header
   of similar intent. The cache control was just me trying to make the
   point that this is mostly just the product of slight differences in
   interpretation. As clearly  demonstrated by the length and content of
   the thread. :D
  
   I poked around the WebDAV RFC but found nothing of note.
  
 
  Full-Commit clearly doesn't belong in the request body (that's where
  the doc goes) and it doesn't quite fit in the resource identifier
  either.
 
  There's not much left but the headers...
  http://tools.ietf.org/html/rfc3864 doesn't look like too much trouble,
  and then we'd   be playing by the rules.
 
 
  --
  Chris Anderson
  http://jchris.mfdz.com

[jira] Commented: (COUCHDB-41) Differentiate between not existent database and not existant document in response on 404

2009-02-20 Thread Till Klampaeckel (JIRA)


[ 
https://issues.apache.org/jira/browse/COUCHDB-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12675450#action_12675450
 ] 

Till Klampaeckel commented on COUCHDB-41:
-

I'd like more obvious error messages:

Document not found.

Database does not exist.

 Differentiate between not existent database and not existant document in 
 response on 404
 

 Key: COUCHDB-41
 URL: https://issues.apache.org/jira/browse/COUCHDB-41
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
 Environment: CouchDB 0.7.2
Reporter: Kore Nordmann
Assignee: Jan Lehnardt
Priority: Minor

 Currently it is not possible to know from a request on a not existent 
 document in the database, if the database does not exist, or if it is just 
 the document which is missing.
 It would be nice to have this information in the response JSON structure, so 
 that the application using the CouchDB could handle such errors more 
 gracefully. An extract from a CouchDB interaction showing the problem:
 == Ensure database is absent
 DELETE /test HTTP/1.0
 Host: localhost
 HTTP/1.0 404 Object Not Found
 Server: inets/develop
 Date: Tue, 15 Apr 2008 20:54:20 GMT
 Cache-Control: no-cache
 Pragma: no-cache
 Expires: Tue, 15 Apr 2008 20:54:20 GMT
 Connection: close
 Content-Type: text/plain;charset=utf-8
 {error:not_found,reason:missing}
 == Try GET on absent database
 GET /test/not_existant HTTP/1.0
 Host: localhost
 HTTP/1.0 404 Object Not Found
 Server: inets/develop
 Date: Tue, 15 Apr 2008 20:54:20 GMT
 Cache-Control: no-cache
 Pragma: no-cache
 Expires: Tue, 15 Apr 2008 20:54:20 GMT
 Connection: close
 Content-Type: text/plain;charset=utf-8
 {error:not_found,reason:missing}
 == Create database, but not the document
 PUT /test HTTP/1.0
 Host: localhost
 HTTP/1.0 201 Created
 Server: inets/develop
 Date: Tue, 15 Apr 2008 20:54:20 GMT
 Cache-Control: no-cache
 Pragma: no-cache
 Expires: Tue, 15 Apr 2008 20:54:20 GMT
 Connection: close
 Content-Type: text/plain;charset=utf-8
 {ok:true}
 == Try to fetch document again
 GET /test/not_existant HTTP/1.0
 Host: localhost
 HTTP/1.0 404 Object Not Found
 Server: inets/develop
 Date: Tue, 15 Apr 2008 20:54:20 GMT
 Cache-Control: no-cache
 Pragma: no-cache
 Expires: Tue, 15 Apr 2008 20:54:20 GMT
 Connection: close
 Content-Type: text/plain;charset=utf-8
 {error:not_found,reason:missing}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Using HTTP headers

On Thu, Feb 19, 2009 at 3:42 PM, Noah Slater nsla...@apache.org wrote:
 On Thu, Feb 19, 2009 at 03:27:26PM -0800, Chris Anderson wrote:
 On Thu, Feb 19, 2009 at 3:19 PM, Chris Anderson jch...@apache.org wrote:
  ... it doesn't quite fit in the resource identifier
  either.

 When you:

 PUT /db/docid?rev=R

 vs

 PUT /db/docid?rev=Rfull_commit=true

 They are the same resource (so these URLs are misleading). But I kinda
 think full-commit should go in the URI anyway, just so that weak
 clients can use it.

 But are they the same resource? Is it not up to the server to decide what
 resource to create, and how to create it, depending on the URI the client uses
 to PUT the representation?

 I suspect that using query strings with PUT methods like this is RESTful.

I'm basically ready to agree with you.



-- 
Chris Anderson
http://jchris.mfdz.com

Re: Partitioned Clusters

2009-02-20 Thread Damien Katz



On Feb 20, 2009, at 1:55 PM, Stefan Karpinski wrote:


Hi, I thought I'd introduce myself since I'm new here on the couchdb
list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
Akamai, Operations RD at Citrix Online, and I'm nearly done with a
PhD in computer networking at the moment. So I guess I've thought
about this kind of stuff a bit ;-)

I'm curious what the motivation behind a tree topology is. Not that
it's not a viable approach, just why that and not a load-balancer in
front of a bunch of leaves with lateral propagation between the
leaves? Why should the load-balancing/proxying/caching node even be
running couchdb?

One reason I can see for a tree topology would be the hierarchical
cache effect. But that would likely only make sense in certain
circumstances. Being able to configure the topology to meet various
needs, rather than enforcing one particular topology makes more sense
to me overall.


Trees would be overkill except for with very large clusters.

With CouchDB map views, you need to combine results from every node in  
a big merge sort. If you combine all results at a single node, the  
single clients ability to simultaneously pull data and sort data from  
all other nodes may become the bottleneck. So to parallelize, you have  
multiple nodes doing a merge sort of sub nodes , then sending those  
results to another node to be combined further, etc.  The same with  
with the reduce views, but instead of a merge sort it's just  
rereducing results. The natural shape of that computation is a tree,  
with only the final root node at the top being the bottleneck, but now  
it has to maintain connections and merge the sort values from far  
fewer nodes.


-Damien

User Auth

d...@couchdb,

There's been much talk about how to store user accounts in CouchDB
databases . There are a few questions about the model, and description
of how the default handler works.

Currently Admin accounts are specified under config on a per node
basis. See couchdb_httpd:default_authentication_handler/1 for the
implementation. By making your own alternate auth handlers, you can
setup user_ctx for validation functions.

To override couchdb_httpd:default_authentication_handler in your local
install, edit your local.ini and add a line like:

authentication_handler = {couch_httpd, my_authentication_handler}

to the [httpd] section.

Now you're ready to start hacking!

I'd encourage you to reuse basic_username_pw(Req), and
couch_server:is_admin(User, Pass) in your implementations. Once you've
database-lookup for usernames happening, you'll have to make a
decision about how to handle the case where the same username and
password combo works for both user and admin accounts.

Ideally you'd want to prevent collisions between admin and user
accounts. But you can't very well prevent writes to the local.ini
file, so... maybe they need different login endpoints. I'm not sure
what to do here, this is a sticky bit.

I'm interested in loading user creds from an accounts database
(specified on a per-node basis, for simplicities sake). To handle
encryption, we'll have to have a node-secret, which can probably be
uniquely generated by Couch on boot if it doesn't exist (of course,
over-rideable in local.ini)

Jason Davies has been working on a /_login screen, and an untamperable
cookie store, so that creds don't have to be loaded on each request.
Client-state ftw!

Untamperable cookies can be subject to session stealing attacks, so
they'll need to be run under SSL if security is important.

There's still the database-admin role to account for, so let's not
forget there are details unaccounted for here.

Hoping to kick off the conversation.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Partitioned Clusters


 Trees would be overkill except for with very large clusters.


 With CouchDB map views, you need to combine results from every node in a
 big merge sort. If you combine all results at a single node, the single
 clients ability to simultaneously pull data and sort data from all other
 nodes may become the bottleneck. So to parallelize, you have multiple nodes
 doing a merge sort of sub nodes , then sending those results to another node
 to be combined further, etc.  The same with with the reduce views, but
 instead of a merge sort it's just rereducing results. The natural shape of
 that computation is a tree, with only the final root node at the top being
 the bottleneck, but now it has to maintain connections and merge the sort
 values from far fewer nodes.

 -Damien


 That makes sense and it clarifies one of my questions about this topic. Is
the goal of partitioned clustering to increase performance for very large
data sets, or to increase reliability? It would seem from this answere that
the goal is to increase query performance by distributing the query
processing, and not to increase reliability.

Re: User Auth

I'm not entirely clear what level of user auth is being addressed here.

On the one hand, there's the system-level sense of a user that traditional
databases have: i.e. something equivalent to a UNIX user account, but in the
database, which has to be created by an admin and can then be granted
table-level access and various administrative rights (create user, create
database, create table).

On the other hand, there's the application-level sense of user: i.e. a
record in a users table, which is given access or not given access to
database records via the web application stack at a higher level, which sits
between the database and the client's web browser (or whatever).

The current CouchDB notion of admin user seems to fall into the former
category, while what most applications need falls into the latter category.
One irritation of all application-level authentication schemes I've ever
encountered is that the database does not give you any support for
application-level user auth. If CouchApps are really going to be feasible,
CouchDB (clearly) needs to solve the application-level user authentication
problem.

My sense is that the goal is to somehow merge the two senses of database
user, and thereby cleave the Gordian knot in two. Is that sense correct?

On Fri, Feb 20, 2009 at 1:24 PM, Chris Anderson jch...@apache.org wrote:

 d...@couchdb,

 There's been much talk about how to store user accounts in CouchDB
 databases . There are a few questions about the model, and description
 of how the default handler works.

 Currently Admin accounts are specified under config on a per node
 basis. See couchdb_httpd:default_authentication_handler/1 for the
 implementation. By making your own alternate auth handlers, you can
 setup user_ctx for validation functions.

 To override couchdb_httpd:default_authentication_handler in your local
 install, edit your local.ini and add a line like:

 authentication_handler = {couch_httpd, my_authentication_handler}

 to the [httpd] section.

 Now you're ready to start hacking!

 I'd encourage you to reuse basic_username_pw(Req), and
 couch_server:is_admin(User, Pass) in your implementations. Once you've
 database-lookup for usernames happening, you'll have to make a
 decision about how to handle the case where the same username and
 password combo works for both user and admin accounts.

 Ideally you'd want to prevent collisions between admin and user
 accounts. But you can't very well prevent writes to the local.ini
 file, so... maybe they need different login endpoints. I'm not sure
 what to do here, this is a sticky bit.

 I'm interested in loading user creds from an accounts database
 (specified on a per-node basis, for simplicities sake). To handle
 encryption, we'll have to have a node-secret, which can probably be
 uniquely generated by Couch on boot if it doesn't exist (of course,
 over-rideable in local.ini)

 Jason Davies has been working on a /_login screen, and an untamperable
 cookie store, so that creds don't have to be loaded on each request.
 Client-state ftw!

 Untamperable cookies can be subject to session stealing attacks, so
 they'll need to be run under SSL if security is important.

 There's still the database-admin role to account for, so let's not
 forget there are details unaccounted for here.

 Hoping to kick off the conversation.

 Chris

 --
 Chris Anderson
 http://jchris.mfdz.com

Re: Partitioned Clusters

2009-02-20 Thread Damien Katz



On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:



Trees would be overkill except for with very large clusters.



With CouchDB map views, you need to combine results from every node  
in a
big merge sort. If you combine all results at a single node, the  
single
clients ability to simultaneously pull data and sort data from all  
other
nodes may become the bottleneck. So to parallelize, you have  
multiple nodes
doing a merge sort of sub nodes , then sending those results to  
another node
to be combined further, etc.  The same with with the reduce views,  
but
instead of a merge sort it's just rereducing results. The natural  
shape of
that computation is a tree, with only the final root node at the  
top being
the bottleneck, but now it has to maintain connections and merge  
the sort

values from far fewer nodes.

-Damien



That makes sense and it clarifies one of my questions about this  
topic. Is
the goal of partitioned clustering to increase performance for very  
large
data sets, or to increase reliability? It would seem from this  
answere that

the goal is to increase query performance by distributing the query
processing, and not to increase reliability.



I see partitioning and clustering as 2 different things. Partitioning  
is data partitioning, spreading the data out across nodes, no node  
having the complete database. Clustering is nodes having the same, or  
nearly the same data (they might be behind on replicating changes, but  
otherwise they have the same data).


Partitioning would primarily increase write performance (updates  
happening concurrently on many nodes) and the size of the data set.  
Partitioning helps with client read scalability, but only for document  
reads, not views queries. Partitioning alone could reduce reliability,  
depending how tolerant you are to missing portions of the database.


Clustering would primarily address database reliability (failover),  
address client read scalability for docs and views. Clustering doesn't  
help much with write performance because even if you spread out the  
update load, the replication as the cluster syncs up means every node  
gets the update anyway. It might be useful to deal with update spikes,  
where you get a bunch of updates at once and can wait for the  
replication delay to get everyone synced back up.


For really big, really reliable database, I'd have clusters of  
partitions, where the database is partitioned N ways, each each  
partition have at least M identical cluster members. Increase N for  
larger databases and update load, M for higher availability and read  
load.


-Damien

Re: Partitioned Clusters

On Fri, Feb 20, 2009 at 1:37 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:

  That makes sense and it clarifies one of my questions about this topic. Is
 the goal of partitioned clustering to increase performance for very large
 data sets, or to increase reliability? It would seem from this answere that
 the goal is to increase query performance by distributing the query
 processing, and not to increase reliability.


Data redundancy is taken care of orthogonally to partitioning. Each
node will be able to handle maintaining N hot-failover backups.
Whether the database is hosted on a single large node or partitioned
among many small ones, the redundancy story is the same.

Partitioning becomes useful when either the total update rate is
greater than the hard-disk throughput on a single node, or the stored
capacity is better managed by multiple disks.

By spreading write load across nodes you can achieve greater
throughput. The view queries must be sent to every node, so having
docs partitioned also allows views to be calculated in parallel. It
will be interesting to see if it makes sense to partition small
databases across hundreds of nodes in the interested of performance.

Chris

-- 
Chris Anderson
http://jchris.mfdz.com

[jira] Created: (COUCHDB-263) require valid user for all database operations

2009-02-20 Thread Jack Moffitt (JIRA)

require valid user for all database operations
--

 Key: COUCHDB-263
 URL: https://issues.apache.org/jira/browse/COUCHDB-263
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
Affects Versions: 0.9
 Environment: All platforms.
Reporter: Jack Moffitt
Priority: Minor
 Attachments: couchauth.diff

Admin accounts currently restrict a few operations, but leave all other 
operations completely open.  Many use cases will require all operations to be 
authenticated.   This can certainly be done by overriding the 
default_authentication_handler, but I think this very common use case can be 
handled in default_authentication_handler without increasing the complexity 
much.

Attached is a patch which adds a new config option, require_valid_user, which 
restricts all operations to authenticated users only.  Since CouchDB currently 
only has admins, this means that all operations are restricted to admins.  In a 
future CouchDB where there are also normal users, the intention is that this 
would let them pass through as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (COUCHDB-263) require valid user for all database operations

2009-02-20 Thread Jack Moffitt (JIRA)


 [ 
https://issues.apache.org/jira/browse/COUCHDB-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Moffitt updated COUCHDB-263:
-

Attachment: couchauth.diff

Patch to add require_valid_user to httpd config section.

 require valid user for all database operations
 --

 Key: COUCHDB-263
 URL: https://issues.apache.org/jira/browse/COUCHDB-263
 Project: CouchDB
  Issue Type: Improvement
  Components: HTTP Interface
Affects Versions: 0.9
 Environment: All platforms.
Reporter: Jack Moffitt
Priority: Minor
 Attachments: couchauth.diff


 Admin accounts currently restrict a few operations, but leave all other 
 operations completely open.  Many use cases will require all operations to be 
 authenticated.   This can certainly be done by overriding the 
 default_authentication_handler, but I think this very common use case can be 
 handled in default_authentication_handler without increasing the complexity 
 much.
 Attached is a patch which adds a new config option, require_valid_user, 
 which restricts all operations to authenticated users only.  Since CouchDB 
 currently only has admins, this means that all operations are restricted to 
 admins.  In a future CouchDB where there are also normal users, the intention 
 is that this would let them pass through as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Partitioned Clusters

On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz dam...@apache.org wrote:

 On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:


 Trees would be overkill except for with very large clusters.


 With CouchDB map views, you need to combine results from every node in a
 big merge sort. If you combine all results at a single node, the single
 clients ability to simultaneously pull data and sort data from all other
 nodes may become the bottleneck. So to parallelize, you have multiple
 nodes
 doing a merge sort of sub nodes , then sending those results to another
 node
 to be combined further, etc.  The same with with the reduce views, but
 instead of a merge sort it's just rereducing results. The natural shape
 of
 that computation is a tree, with only the final root node at the top
 being
 the bottleneck, but now it has to maintain connections and merge the sort
 values from far fewer nodes.

 -Damien


 That makes sense and it clarifies one of my questions about this topic. Is
 the goal of partitioned clustering to increase performance for very large
 data sets, or to increase reliability? It would seem from this answere
 that
 the goal is to increase query performance by distributing the query
 processing, and not to increase reliability.


 I see partitioning and clustering as 2 different things. Partitioning is
 data partitioning, spreading the data out across nodes, no node having the
 complete database. Clustering is nodes having the same, or nearly the same
 data (they might be behind on replicating changes, but otherwise they have
 the same data).

 Partitioning would primarily increase write performance (updates happening
 concurrently on many nodes) and the size of the data set. Partitioning helps
 with client read scalability, but only for document reads, not views
 queries. Partitioning alone could reduce reliability, depending how tolerant
 you are to missing portions of the database.

 Clustering would primarily address database reliability (failover), address
 client read scalability for docs and views. Clustering doesn't help much
 with write performance because even if you spread out the update load, the
 replication as the cluster syncs up means every node gets the update anyway.
 It might be useful to deal with update spikes, where you get a bunch of
 updates at once and can wait for the replication delay to get everyone
 synced back up.

 For really big, really reliable database, I'd have clusters of partitions,
 where the database is partitioned N ways, each each partition have at least
 M identical cluster members. Increase N for larger databases and update
 load, M for higher availability and read load.


Thanks for the clarification.

Can you say anything about how you see rebalancing working?



-- 
Chris Anderson
http://jchris.mfdz.com

Re: User Auth

On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 I'm not entirely clear what level of user auth is being addressed here.

 On the one hand, there's the system-level sense of a user that traditional
 databases have: i.e. something equivalent to a UNIX user account, but in the
 database, which has to be created by an admin and can then be granted
 table-level access and various administrative rights (create user, create
 database, create table).

 On the other hand, there's the application-level sense of user: i.e. a
 record in a users table, which is given access or not given access to
 database records via the web application stack at a higher level, which sits
 between the database and the client's web browser (or whatever).

 The current CouchDB notion of admin user seems to fall into the former
 category, while what most applications need falls into the latter category.
 One irritation of all application-level authentication schemes I've ever
 encountered is that the database does not give you any support for
 application-level user auth. If CouchApps are really going to be feasible,
 CouchDB (clearly) needs to solve the application-level user authentication
 problem.

 My sense is that the goal is to somehow merge the two senses of database
 user, and thereby cleave the Gordian knot in two. Is that sense correct?

I wish I could say we've got such a clear picture of it.

The easiest way to cleave the knot is probably to rely on 3rd party
auth like OpenID or OAuth (I don't quite know which parts of which
we're interested in).

Identifying users as URLs would make things easier on application devs, I think.

If every app will need to implement something like this, it makes
sense to me to have the CouchDB host manage the session, even if apps
can keep their own user-preferences docs if they wish. Being logged
into all the apps on a node seems a lot more intuitive than having to
create accounts for each one. If the user is identified with a URL,
then preferences etc can be replicated to other hosts while everything
just works.

Thanks for the feedback!

-- 
Chris Anderson
http://jchris.mfdz.com

Re: User Auth

 I wish I could say we've got such a clear picture of it.

Good to get in on the planning stages!

 The easiest way to cleave the knot is probably to rely on 3rd party
 auth like OpenID or OAuth (I don't quite know which parts of which
 we're interested in).

OpenID is great, but I don't think it's viable to force people to use it.

 Identifying users as URLs would make things easier on application devs, I 
 think.

 If every app will need to implement something like this, it makes
 sense to me to have the CouchDB host manage the session, even if apps
 can keep their own user-preferences docs if they wish. Being logged
 into all the apps on a node seems a lot more intuitive than having to
 create accounts for each one. If the user is identified with a URL,
 then preferences etc can be replicated to other hosts while everything
 just works.

I think that nailing this problem would go a *long* way towards making
CouchDB popular not only for it's nice distributed properties and
such, but also because would make writing modern web apps drastically
easier. Because literally *every* non-trivial web application needs to
do user authentication. Having it _just work_ without having to worry
about it is a massive win. Moreover, if the database was actually
aware of application-level authentication and could enforce it, then
it would increase the security of CouchDB-based web apps. Errors in
business logic would be much less likely to accidentally expose data.
How easy is it to forget in Rails that you need to filter the objects
in some table by the user_id field?

On Fri, Feb 20, 2009 at 3:01 PM, Chris Anderson jch...@apache.org wrote:

 On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
  I'm not entirely clear what level of user auth is being addressed here.
 
  On the one hand, there's the system-level sense of a user that traditional
  databases have: i.e. something equivalent to a UNIX user account, but in the
  database, which has to be created by an admin and can then be granted
  table-level access and various administrative rights (create user, create
  database, create table).
 
  On the other hand, there's the application-level sense of user: i.e. a
  record in a users table, which is given access or not given access to
  database records via the web application stack at a higher level, which sits
  between the database and the client's web browser (or whatever).
 
  The current CouchDB notion of admin user seems to fall into the former
  category, while what most applications need falls into the latter category.
  One irritation of all application-level authentication schemes I've ever
  encountered is that the database does not give you any support for
  application-level user auth. If CouchApps are really going to be feasible,
  CouchDB (clearly) needs to solve the application-level user authentication
  problem.
 
  My sense is that the goal is to somehow merge the two senses of database
  user, and thereby cleave the Gordian knot in two. Is that sense correct?

 I wish I could say we've got such a clear picture of it.

 The easiest way to cleave the knot is probably to rely on 3rd party
 auth like OpenID or OAuth (I don't quite know which parts of which
 we're interested in).

 Identifying users as URLs would make things easier on application devs, I 
 think.

 If every app will need to implement something like this, it makes
 sense to me to have the CouchDB host manage the session, even if apps
 can keep their own user-preferences docs if they wish. Being logged
 into all the apps on a node seems a lot more intuitive than having to
 create accounts for each one. If the user is identified with a URL,
 then preferences etc can be replicated to other hosts while everything
 just works.

 Thanks for the feedback!

 --
 Chris Anderson
 http://jchris.mfdz.com

[jira] Updated: (COUCHDB-260) Support for reduce views in _list

2009-02-20 Thread Jason Davies (JIRA)


 [ 
https://issues.apache.org/jira/browse/COUCHDB-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Davies updated COUCHDB-260:
-

Attachment: list_reduce_views.2.diff

Updated patch with fix for handling {stop: true} (thanks, Paul Davis) and more 
tests.

 Support for reduce views in _list
 -

 Key: COUCHDB-260
 URL: https://issues.apache.org/jira/browse/COUCHDB-260
 Project: CouchDB
  Issue Type: Bug
  Components: HTTP Interface
Reporter: Jason Davies
Priority: Blocker
 Fix For: 0.9

 Attachments: list_reduce_views.2.diff, list_reduce_views.diff


 The awesomeness of _list needs the awesomeness of reduce views.  Patch to 
 follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: User Auth


Hi Stefan,

good to have you o board,

On 21 Feb 2009, at 00:16, Stefan Karpinski wrote:


I think that nailing this problem would go a *long* way towards making
CouchDB popular not only for it's nice distributed properties and
such, but also because would make writing modern web apps drastically
easier. Because literally *every* non-trivial web application needs to
do user authentication. Having it _just work_ without having to worry
about it is a massive win. Moreover, if the database was actually
aware of application-level authentication and could enforce it, then
it would increase the security of CouchDB-based web apps. Errors in
business logic would be much less likely to accidentally expose data.
How easy is it to forget in Rails that you need to filter the objects
in some table by the user_id field?


My thoughts* exactly!

* http://markmail.org/message/thqtiuz3a5hr2ngd

Cheers
Jan
--

On Fri, Feb 20, 2009 at 3:01 PM, Chris Anderson jch...@apache.org  
wrote:


On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
I'm not entirely clear what level of user auth is being addressed  
here.


On the one hand, there's the system-level sense of a user that  
traditional
databases have: i.e. something equivalent to a UNIX user account,  
but in the
database, which has to be created by an admin and can then be  
granted
table-level access and various administrative rights (create user,  
create

database, create table).

On the other hand, there's the application-level sense of user:  
i.e. a
record in a users table, which is given access or not given access  
to
database records via the web application stack at a higher level,  
which sits

between the database and the client's web browser (or whatever).

The current CouchDB notion of admin user seems to fall into the  
former
category, while what most applications need falls into the latter  
category.
One irritation of all application-level authentication schemes  
I've ever

encountered is that the database does not give you any support for
application-level user auth. If CouchApps are really going to be  
feasible,
CouchDB (clearly) needs to solve the application-level user  
authentication

problem.

My sense is that the goal is to somehow merge the two senses of  
database
user, and thereby cleave the Gordian knot in two. Is that sense  
correct?


I wish I could say we've got such a clear picture of it.

The easiest way to cleave the knot is probably to rely on 3rd party
auth like OpenID or OAuth (I don't quite know which parts of which
we're interested in).

Identifying users as URLs would make things easier on application  
devs, I think.


If every app will need to implement something like this, it makes
sense to me to have the CouchDB host manage the session, even if apps
can keep their own user-preferences docs if they wish. Being logged
into all the apps on a node seems a lot more intuitive than having to
create accounts for each one. If the user is identified with a URL,
then preferences etc can be replicated to other hosts while  
everything

just works.

Thanks for the feedback!

--
Chris Anderson
http://jchris.mfdz.com

Re: User Auth

Thoughts (just brainstorming here):

I think it makes sense to separate authentication and permissions.
Pure authentication is just about verifying that the user is who they
claim to be. Permissions are about deciding which users are allowed to
see or do what. Cleanly separating is good: ideally you should be able
to completely swap out your authentication mechanism, switching from,
say basic auth to SSL + digest auth, and keep the application logic
about who gets access to what completely unchanged. For example,
Apache accomplishes this by doing whatever authentication it's doing
and then passing the REMOTE_USER environment variable
(http://httpd.apache.org/docs/1.3/misc/FAQ-F.html#remote-user-var)
with the authenticated user name. Whatever CGI or variant thereof
(FCGI, etc.) then just does whatever it sees fit to do with that user
name.

Also, authentication is typically slow: even hashing a user/pass combo
takes some CPU — this is not something that you want to have done on
every request. That's why the notion of user sessions exists (at least
from the security perspective; there are other notions of session).
That argues for having the CouchDB server process manage
authentication and letting the application developer define custom
functions for deciding whether (user,resource) pairs are acceptable or
not. I.e. the CouchDB process somehow validates that the request is
coming from someone who has provided adequate proof that they are who
they claim to be, via HTTP basic/digest auth or whatever. Then the
application can just decide whether the pre-authenticated user is
allowed to access a particular resource.

More thoughts coming...

On Fri, Feb 20, 2009 at 3:16 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 I wish I could say we've got such a clear picture of it.

 Good to get in on the planning stages!

 The easiest way to cleave the knot is probably to rely on 3rd party
 auth like OpenID or OAuth (I don't quite know which parts of which
 we're interested in).

 OpenID is great, but I don't think it's viable to force people to use it.

 Identifying users as URLs would make things easier on application devs, I 
 think.

 If every app will need to implement something like this, it makes
 sense to me to have the CouchDB host manage the session, even if apps
 can keep their own user-preferences docs if they wish. Being logged
 into all the apps on a node seems a lot more intuitive than having to
 create accounts for each one. If the user is identified with a URL,
 then preferences etc can be replicated to other hosts while everything
 just works.

 I think that nailing this problem would go a *long* way towards making
 CouchDB popular not only for it's nice distributed properties and
 such, but also because would make writing modern web apps drastically
 easier. Because literally *every* non-trivial web application needs to
 do user authentication. Having it _just work_ without having to worry
 about it is a massive win. Moreover, if the database was actually
 aware of application-level authentication and could enforce it, then
 it would increase the security of CouchDB-based web apps. Errors in
 business logic would be much less likely to accidentally expose data.
 How easy is it to forget in Rails that you need to filter the objects
 in some table by the user_id field?

 On Fri, Feb 20, 2009 at 3:01 PM, Chris Anderson jch...@apache.org wrote:

 On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
  I'm not entirely clear what level of user auth is being addressed here.
 
  On the one hand, there's the system-level sense of a user that traditional
  databases have: i.e. something equivalent to a UNIX user account, but in 
  the
  database, which has to be created by an admin and can then be granted
  table-level access and various administrative rights (create user, create
  database, create table).
 
  On the other hand, there's the application-level sense of user: i.e. a
  record in a users table, which is given access or not given access to
  database records via the web application stack at a higher level, which 
  sits
  between the database and the client's web browser (or whatever).
 
  The current CouchDB notion of admin user seems to fall into the former
  category, while what most applications need falls into the latter category.
  One irritation of all application-level authentication schemes I've ever
  encountered is that the database does not give you any support for
  application-level user auth. If CouchApps are really going to be feasible,
  CouchDB (clearly) needs to solve the application-level user authentication
  problem.
 
  My sense is that the goal is to somehow merge the two senses of database
  user, and thereby cleave the Gordian knot in two. Is that sense correct?

 I wish I could say we've got such a clear picture of it.

 The easiest way to cleave the knot is probably to rely on 3rd party
 auth like OpenID or OAuth (I don't quite know

Re: New CouchDB Committers



On 18 Feb 2009, at 20:18, Damien Katz wrote:

For their ongoing contributions to Apache CouchDB and it's  
community. I am pleased to announce two new committers, Paul Davis  
and Adam Kocoloski.


Thank you both for your excellent work, we all love what you've been  
doing. Now we want you to do it even more ;)


Welcome Paul and Adam!

Cheers
Jan
--

Re: User Auth

On Fri, Feb 20, 2009 at 3:48 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 Thoughts (just brainstorming here):

 I think it makes sense to separate authentication and permissions.
 Pure authentication is just about verifying that the user is who they
 claim to be. Permissions are about deciding which users are allowed to
 see or do what. Cleanly separating is good: ideally you should be able
 to completely swap out your authentication mechanism, switching from,
 say basic auth to SSL + digest auth, and keep the application logic
 about who gets access to what completely unchanged. For example,
 Apache accomplishes this by doing whatever authentication it's doing
 and then passing the REMOTE_USER environment variable
 (http://httpd.apache.org/docs/1.3/misc/FAQ-F.html#remote-user-var)
 with the authenticated user name. Whatever CGI or variant thereof
 (FCGI, etc.) then just does whatever it sees fit to do with that user
 name.

 Also, authentication is typically slow: even hashing a user/pass combo
 takes some CPU — this is not something that you want to have done on
 every request. That's why the notion of user sessions exists (at least
 from the security perspective; there are other notions of session).
 That argues for having the CouchDB server process manage
 authentication and letting the application developer define custom
 functions for deciding whether (user,resource) pairs are acceptable or
 not. I.e. the CouchDB process somehow validates that the request is
 coming from someone who has provided adequate proof that they are who
 they claim to be, via HTTP basic/digest auth or whatever. Then the
 application can just decide whether the pre-authenticated user is
 allowed to access a particular resource.


This is pretty much how it works now. CouchDB manages sending the
user_ctx object into the validation function. The user_ctx object then
lets the function know if the user is an admin or in any other groups.
Then the validation function may accept or reject the update
accordingly.

There is a related issue about how to let the client application know
which user they are validated as, so that they can correctly fill out
author fields etc.

 More thoughts coming...

 On Fri, Feb 20, 2009 at 3:16 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
 I wish I could say we've got such a clear picture of it.

 Good to get in on the planning stages!

 The easiest way to cleave the knot is probably to rely on 3rd party
 auth like OpenID or OAuth (I don't quite know which parts of which
 we're interested in).

 OpenID is great, but I don't think it's viable to force people to use it.

 Identifying users as URLs would make things easier on application devs, I 
 think.

 If every app will need to implement something like this, it makes
 sense to me to have the CouchDB host manage the session, even if apps
 can keep their own user-preferences docs if they wish. Being logged
 into all the apps on a node seems a lot more intuitive than having to
 create accounts for each one. If the user is identified with a URL,
 then preferences etc can be replicated to other hosts while everything
 just works.

 I think that nailing this problem would go a *long* way towards making
 CouchDB popular not only for it's nice distributed properties and
 such, but also because would make writing modern web apps drastically
 easier. Because literally *every* non-trivial web application needs to
 do user authentication. Having it _just work_ without having to worry
 about it is a massive win. Moreover, if the database was actually
 aware of application-level authentication and could enforce it, then
 it would increase the security of CouchDB-based web apps. Errors in
 business logic would be much less likely to accidentally expose data.
 How easy is it to forget in Rails that you need to filter the objects
 in some table by the user_id field?

 On Fri, Feb 20, 2009 at 3:01 PM, Chris Anderson jch...@apache.org wrote:

 On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
  I'm not entirely clear what level of user auth is being addressed here.
 
  On the one hand, there's the system-level sense of a user that traditional
  databases have: i.e. something equivalent to a UNIX user account, but in 
  the
  database, which has to be created by an admin and can then be granted
  table-level access and various administrative rights (create user, create
  database, create table).
 
  On the other hand, there's the application-level sense of user: i.e. a
  record in a users table, which is given access or not given access to
  database records via the web application stack at a higher level, which 
  sits
  between the database and the client's web browser (or whatever).
 
  The current CouchDB notion of admin user seems to fall into the former
  category, while what most applications need falls into the latter 
  category.
  One irritation of all application-level authentication schemes I've ever

Re: New CouchDB Committers

2009-02-20 Thread Geir Magnusson Jr.


Congratulations, Paul and Adam!

geir

On Feb 18, 2009, at 2:18 PM, Damien Katz wrote:

For their ongoing contributions to Apache CouchDB and it's  
community. I am pleased to announce two new committers, Paul Davis  
and Adam Kocoloski.


Thank you both for your excellent work, we all love what you've been  
doing. Now we want you to do it even more ;)


-Damien

Re: Partitioned Clusters

2009-02-20 Thread Mike Malone

Hi, I don't think I've commented on this list before so let me briefly
introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer
(primarily web dev) and have some experience working with large clustered
databases. I worked for Pownce.com, but moved to Six Apart when they
acquired Pownce in November 2008.

I like the idea of a tree-structure since it's simple to understand and
implement, but I think there may be cases where having multiple top-level
proxies may make sense. As Damien pointed out, the top-level proxies will
need to re-reduce / merge the documents from each partition, which may
become a bottleneck. Damien pointed out how a tree structure would help to
mitigate this problem by moving some of the work to sub-nodes. But couldn't
you also add additional top-level proxies (with clients randomly choosing
one to communicate with) to increase capacity without requiring a tree
structure. This would also remove the top-level proxy as a single point of
failure for the system.

Mike

On Fri, Feb 20, 2009 at 2:55 PM, Chris Anderson jch...@apache.org wrote:

 On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz dam...@apache.org wrote:
 
  On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote:
 
 
  Trees would be overkill except for with very large clusters.
 
 
  With CouchDB map views, you need to combine results from every node in
 a
  big merge sort. If you combine all results at a single node, the single
  clients ability to simultaneously pull data and sort data from all
 other
  nodes may become the bottleneck. So to parallelize, you have multiple
  nodes
  doing a merge sort of sub nodes , then sending those results to another
  node
  to be combined further, etc.  The same with with the reduce views, but
  instead of a merge sort it's just rereducing results. The natural
 shape
  of
  that computation is a tree, with only the final root node at the top
  being
  the bottleneck, but now it has to maintain connections and merge the
 sort
  values from far fewer nodes.
 
  -Damien
 
 
  That makes sense and it clarifies one of my questions about this topic.
 Is
  the goal of partitioned clustering to increase performance for very
 large
  data sets, or to increase reliability? It would seem from this answere
  that
  the goal is to increase query performance by distributing the query
  processing, and not to increase reliability.
 
 
  I see partitioning and clustering as 2 different things. Partitioning is
  data partitioning, spreading the data out across nodes, no node having
 the
  complete database. Clustering is nodes having the same, or nearly the
 same
  data (they might be behind on replicating changes, but otherwise they
 have
  the same data).
 
  Partitioning would primarily increase write performance (updates
 happening
  concurrently on many nodes) and the size of the data set. Partitioning
 helps
  with client read scalability, but only for document reads, not views
  queries. Partitioning alone could reduce reliability, depending how
 tolerant
  you are to missing portions of the database.
 
  Clustering would primarily address database reliability (failover),
 address
  client read scalability for docs and views. Clustering doesn't help much
  with write performance because even if you spread out the update load,
 the
  replication as the cluster syncs up means every node gets the update
 anyway.
  It might be useful to deal with update spikes, where you get a bunch of
  updates at once and can wait for the replication delay to get everyone
  synced back up.
 
  For really big, really reliable database, I'd have clusters of
 partitions,
  where the database is partitioned N ways, each each partition have at
 least
  M identical cluster members. Increase N for larger databases and update
  load, M for higher availability and read load.
 

 Thanks for the clarification.

 Can you say anything about how you see rebalancing working?



 --
 Chris Anderson
 http://jchris.mfdz.com

Re: Partitioned Clusters

On Fri, Feb 20, 2009 at 4:15 PM, Mike Malone mjmal...@gmail.com wrote:
 Hi, I don't think I've commented on this list before so let me briefly
 introduce myself. I'm Mike Malone. I live in San Francisco. I'm a developer
 (primarily web dev) and have some experience working with large clustered
 databases. I worked for Pownce.com, but moved to Six Apart when they
 acquired Pownce in November 2008.

 I like the idea of a tree-structure since it's simple to understand and
 implement, but I think there may be cases where having multiple top-level
 proxies may make sense.

I think so. I think that there could be proxy overlap / redundancy
across all levels of the tree, and also in the case of a flat tree.

As long as the proxies agree on how to hash from URLs to nodes it
should just work.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: Couch clustering/partitioning Re: CouchSpray - Thoughts?

2009-02-20 Thread Shaun Lindsay

Due to one of the key people being sick, we pushed our meeting to discuss
releasing the code to Monday. I'll send out an update then.

On Fri, Feb 20, 2009 at 2:17 AM, Jan Lehnardt j...@apache.org wrote:

On 20 Feb 2009, at 02:34, Shaun Lindsay wrote:

Hi all,
So, a couple months ago we implemented almost exactly the couch
clustering/partitioning solution described below.

Shaun, this sounds fantastic! :) I hope you can release the code for
this.

Cheers
Jan
--

The couch cluster (which
we called 'The Lounge') sits behind nginx running a custom module that
farms
out the GETs and PUTs to the appropriate node/shard and the views to a
python proxy daemon which handles reducing the view results from the
individual shards and returning the full view. We have replication
working
between the cluster nodes so the shards exist multiple places and, in the
case of one of the nodes going down, the various proxies fail over to the
backup shards.

This clustering setup has been running in full production for several
months
now with minimal problems.

We're looking to release all the code back to the community, but we need
to
clear it with our legal team first to make sure we're not compromising any
of our more business-specific, proprietary code.

In total, we have:
a nginx module specifically set up for sharding databases
a 'smartproxy', written in Python/Twisted, for sharding views
and a few other ancillary pieces (replication notification, view updating,
etc)

Mainly, I just wanted to keep people from duplicating the work we've done
--
hopefully we can release something back to the community in the next
several
weeks.

We're having a meeting tomorrow morning to figure out what we can release
right now (probably the nginx module, at the least). I'll let everyone
know
what out timeline looks like.

--Shaun Lindsay
Meebo.com

On Thu, Feb 19, 2009 at 4:48 PM, Chris Anderson jch...@apache.org
wrote:

On Thu, Feb 19, 2009 at 4:35 PM, Ben Browning ben...@gmail.com wrote:

So, I started thinking about partitioning with CouchDB and realized
that since views are just map/reduce, we can do some magic that's
harder if not impossible with other database systems. The idea in a
nutshell is to create a proxy that sits in front of multiple servers
and sprays the view queries to all servers, merging the results -
hence CouchSpray. This would give us storage and processing
scalability and could, with some extra logic, provide data redundancy
and failover.

There are plans in CouchDB's future to take care of data partitioning,
as well as querying views from a cluster. Theoretically, it should be
pretty simple. There are a few small projects that have started down
the road of writing code in this area.

https://code.launchpad.net/~dreid/sectional/trunkhttps://code.launchpad.net/%7Edreid/sectional/trunk

Sectional is an Erlang http proxy that implements consistent hashing
for docs. I'm not sure how it handles view queries.

There's also a project to provide partitioning around the basic
key/value PUT and GET store using Nginx:

http://github.com/dysinger/nginx/tree/nginx_upstream_hash

If you're interested in digging into this stuff, please join d...@. We
plan to include clustering in CouchDB, so if you're interested in
implementing it, we could use your help.

Chris

--
Chris Anderson
http://jchris.mfdz.com

Re: Couch clustering/partitioning Re: CouchSpray - Thoughts?

On Fri, Feb 20, 2009 at 4:53 PM, Shaun Lindsay sh...@meebo.com wrote:
 Due to one of the key people being sick, we pushed our meeting to discuss
 releasing the code to Monday.  I'll send out an update then.


Thanks for keeping us in the loop!

Hope you all are feeling better.

-- 
Chris Anderson
http://jchris.mfdz.com

Re: User Auth

I apologize in advance that I'm going to digress a bit here on the
shittiness of standard authentication mechanisms...

[ begin rant ]

Ultimately, support for better authentication than HTTP basic/digest
auth would be nice. Basic auth is terribly insecure. Like ridiculously
so. Digest auth is better, but still pretty darned insecure. SSL +
digest auth goes a long way towards fixing the situation, but SSL is a
deployment nightmare, costs money, hogs resources, and makes any sort
of transparent proxying impossible. There's a billion cookie-based
solutions that are almost all nearly as insecure as Digest auth or
worse.

In most respects, the ideal authentication scheme for almost all web
applications would be the Secure Remote Password protocol (SRP),
developed at Stanford (http://srp.stanford.edu/whatisit.html,
http://en.wikipedia.org/wiki/Secure_remote_password_protocol). I was
going to explain, but their website says it much better:

[SRP] solves the problem of authenticating clients to servers
securely, in cases where the user of the client software must memorize
a small secret (like a password) and carries no other secret
information, and where the server carries a verifier for each user,
which allows it to authenticate the client but which, if compromised,
would not allow the attacker to impersonate the client. In addition,
SRP exchanges a cryptographically-strong secret as a byproduct of
successful authentication, which enables the two parties to
communicate securely.

[ end rant ]

So, yeah. Allowing pluggable authentication mechanisms would be sweet.
Since the primary target is RIA's where the security protocols can be
implemented in JavaScript or ActionScript, it should be entirely
possible to allow for alternate authentication without browers having
to change.

On Fri, Feb 20, 2009 at 3:48 PM, Stefan Karpinski
stefan.karpin...@gmail.com wrote:
 Thoughts (just brainstorming here):

 I think it makes sense to separate authentication and permissions.
 Pure authentication is just about verifying that the user is who they
 claim to be. Permissions are about deciding which users are allowed to
 see or do what. Cleanly separating is good: ideally you should be able
 to completely swap out your authentication mechanism, switching from,
 say basic auth to SSL + digest auth, and keep the application logic
 about who gets access to what completely unchanged. For example,
 Apache accomplishes this by doing whatever authentication it's doing
 and then passing the REMOTE_USER environment variable
 (http://httpd.apache.org/docs/1.3/misc/FAQ-F.html#remote-user-var)
 with the authenticated user name. Whatever CGI or variant thereof
 (FCGI, etc.) then just does whatever it sees fit to do with that user
 name.

 Also, authentication is typically slow: even hashing a user/pass combo
 takes some CPU — this is not something that you want to have done on
 every request. That's why the notion of user sessions exists (at least
 from the security perspective; there are other notions of session).
 That argues for having the CouchDB server process manage
 authentication and letting the application developer define custom
 functions for deciding whether (user,resource) pairs are acceptable or
 not. I.e. the CouchDB process somehow validates that the request is
 coming from someone who has provided adequate proof that they are who
 they claim to be, via HTTP basic/digest auth or whatever. Then the
 application can just decide whether the pre-authenticated user is
 allowed to access a particular resource.

 More thoughts coming...

Re: User Auth

Ok, good to know. I could go did around the development code, but it might
be much more expedient just to ask. Is there a special users database? What
about sessions? It would be cool if those were just databases with special
metadata (only settable by admin users, of course). What's in a user_ctx
object at the moment? Does it correspond to an actual CouchDB record?

On Fri, Feb 20, 2009 at 3:56 PM, Chris Anderson jch...@apache.org wrote:

 On Fri, Feb 20, 2009 at 3:48 PM, Stefan Karpinski
 stefan.karpin...@gmail.com wrote:
  Thoughts (just brainstorming here):
 
  I think it makes sense to separate authentication and permissions.
  Pure authentication is just about verifying that the user is who they
  claim to be. Permissions are about deciding which users are allowed to
  see or do what. Cleanly separating is good: ideally you should be able
  to completely swap out your authentication mechanism, switching from,
  say basic auth to SSL + digest auth, and keep the application logic
  about who gets access to what completely unchanged. For example,
  Apache accomplishes this by doing whatever authentication it's doing
  and then passing the REMOTE_USER environment variable
  (http://httpd.apache.org/docs/1.3/misc/FAQ-F.html#remote-user-var)
  with the authenticated user name. Whatever CGI or variant thereof
  (FCGI, etc.) then just does whatever it sees fit to do with that user
  name.
 
  Also, authentication is typically slow: even hashing a user/pass combo
  takes some CPU — this is not something that you want to have done on
  every request. That's why the notion of user sessions exists (at least
  from the security perspective; there are other notions of session).
  That argues for having the CouchDB server process manage
  authentication and letting the application developer define custom
  functions for deciding whether (user,resource) pairs are acceptable or
  not. I.e. the CouchDB process somehow validates that the request is
  coming from someone who has provided adequate proof that they are who
  they claim to be, via HTTP basic/digest auth or whatever. Then the
  application can just decide whether the pre-authenticated user is
  allowed to access a particular resource.
 

 This is pretty much how it works now. CouchDB manages sending the
 user_ctx object into the validation function. The user_ctx object then
 lets the function know if the user is an admin or in any other groups.
 Then the validation function may accept or reject the update
 accordingly.

 There is a related issue about how to let the client application know
 which user they are validated as, so that they can correctly fill out
 author fields etc.

  More thoughts coming...
 
  On Fri, Feb 20, 2009 at 3:16 PM, Stefan Karpinski
  stefan.karpin...@gmail.com wrote:
  I wish I could say we've got such a clear picture of it.
 
  Good to get in on the planning stages!
 
  The easiest way to cleave the knot is probably to rely on 3rd party
  auth like OpenID or OAuth (I don't quite know which parts of which
  we're interested in).
 
  OpenID is great, but I don't think it's viable to force people to use
 it.
 
  Identifying users as URLs would make things easier on application devs,
 I think.
 
  If every app will need to implement something like this, it makes
  sense to me to have the CouchDB host manage the session, even if apps
  can keep their own user-preferences docs if they wish. Being logged
  into all the apps on a node seems a lot more intuitive than having to
  create accounts for each one. If the user is identified with a URL,
  then preferences etc can be replicated to other hosts while everything
  just works.
 
  I think that nailing this problem would go a *long* way towards making
  CouchDB popular not only for it's nice distributed properties and
  such, but also because would make writing modern web apps drastically
  easier. Because literally *every* non-trivial web application needs to
  do user authentication. Having it _just work_ without having to worry
  about it is a massive win. Moreover, if the database was actually
  aware of application-level authentication and could enforce it, then
  it would increase the security of CouchDB-based web apps. Errors in
  business logic would be much less likely to accidentally expose data.
  How easy is it to forget in Rails that you need to filter the objects
  in some table by the user_id field?
 
  On Fri, Feb 20, 2009 at 3:01 PM, Chris Anderson jch...@apache.org
 wrote:
 
  On Fri, Feb 20, 2009 at 1:51 PM, Stefan Karpinski
  stefan.karpin...@gmail.com wrote:
   I'm not entirely clear what level of user auth is being addressed
 here.
  
   On the one hand, there's the system-level sense of a user that
 traditional
   databases have: i.e. something equivalent to a UNIX user account, but
 in the
   database, which has to be created by an admin and can then be granted
   table-level access and various administrative rights (create user,
 create
   database, create

Re: User Auth