Scaling db-per-user

Geoffrey Cox Thu, 16 Nov 2017 07:34:30 -0800

Hi Adam/Jan,

Thanks for the votes on Spiegel! I've split off this topic with a new email
subject as I don't want to hijack Jan's "_access per doc" thread.


Adam, below you say this in regards to db-per-user, "It works OK up to
several thousand users, but eventually you start running into a lot of
operational headaches. Running a million databases in a cluster is possible
but painful." Can you please elaborate on this? Is the pain in syncing all
these DBs? If so, Spiegel <https://github.com/redgeoff/spiegel> should
solve that problem, right?

I'm particularly interested in what happens when the amount of DBs in a
cluster becomes very large. With Quizster <https://quizster.co>, we have
over 10,000 databases already and the number is growing quickly. I'm hoping
we'll eventually have millions of databases and it would be nice to know if
this design is going to break at some point. From what I understand, having
millions or even billions of databases shouldn't be an issue for CouchDB if
you horizontally scale and just keep adding nodes. Am I misunderstanding
this?

Thanks!

Geoff

On Thu, Nov 16, 2017 at 1:01 AM Jan Lehnardt <[email protected]> wrote:

>
> > On 16. Nov 2017, at 03:09, Adam Kocoloski <[email protected]> wrote:
> >
> > Oh also, meant to say - nice work on Spiegel :)
>
> +1, nice work Geoff,
>
> Adam summed it all up nicely.
>
> Best
> Jan
> --
>
> >
> >> On Nov 15, 2017, at 9:00 PM, Adam Kocoloski <[email protected]>
> wrote:
> >>
> >> Hiya Geoff,
> >>
> >> You’re right, there is a non-trivial overhead in calculating view
> responses that need to pull from every shard. On the other hand,
> maintaining a unique database file for every user is quite problematic at
> scale. It works OK up to several thousand users, but eventually you start
> running into a lot of operational headaches. Running a million databases in
> a cluster is possible but painful.
> >>
> >> I have some detailed thoughts about how we can improve the efficiency
> of queries scoped to a single user in a large sharded database, but that’s
> a topic for another thread :)
> >>
> >> Jan - wow, look at that! I’ll take a close look over the next couple of
> hours but a quick scan is encouraging.
> >>
> >> Adam
> >>
> >>> On Nov 15, 2017, at 5:30 PM, Geoffrey Cox <[email protected]> wrote:
> >>>
> >>> Hey Jan,
> >>>
> >>> I've been trying to solve a similar problem from a different angle
> using
> >>> efficient and scalable replication via spiegel
> >>> <https://github.com/redgeoff/spiegel>. I'm super excited that you are
> >>> drafting this level of access, but my major concern is on performance.
> From
> >>> what I gather, if you combine all the db-per-user docs into a single DB
> >>> then you'll have a massive DB. I know CouchDB is good at sharding, but
> >>> isn't there a significant performance implication when a user's docs
> are
> >>> being pulled from multiple shards on different servers? What about the
> >>> added overhead of calculating cross-server views, etc...
> >>>
> >>> When I think about how big companies, e.g. Facebook, solve these types
> of
> >>> problems, I imagine that they create a denormalized DB per user. Among
> >>> other things, this design allows the set of data that a user needs to
> be
> >>> relatively small and live on less servers per user. Doesn't this lead
> to
> >>> better performance?
> >>>
> >>> Even if this new level of access doesn't solve the db-per-user case
> >>> entirely, it will still be a useful addition as it would allow for more
> >>> data to be shared and less of a create a DB-per-role setup. So, I'm
> all for
> >>> it!
> >>>
> >>> I'll take a closer look at these notes when I have some time, but I
> just
> >>> wanted to get you my high-level thoughts now. I'm sorry if any of this
> has
> >>> been based on some wild assumptions :)
> >>>
> >>> Exciting stuff!
> >>>
> >>> Geoff
> >>>
> >>>> On Wed, Nov 15, 2017 at 1:35 PM Jan Lehnardt <[email protected]> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> in the midst of handling the security stuff I had a moment of clarity
> how
> >>>> the often requested per document permissions could be implemented. We
> had
> >>>> discussed a potential approach extensively in the February Boston
> Developer
> >>>> Summit (notes here:
> >>>>
> https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E
> >>>> )
> >>>>
> >>>> What was so alluring about this proposal was that it solves per doc
> access
> >>>> control and per-user-db in one go. E.g. it would be able to share a
> single
> >>>> database with multiple distrusting users, allow them to have their
> own set
> >>>> of views, and even independently use their share of a single database
> as a
> >>>> replication endpoint without interfering with any of the other users
> on
> >>>> that database.
> >>>>
> >>>> I gave it a shot. Essentially, we need to build new indexes:
> by-access-id
> >>>> and by-access-seq to make all that work. I’m just focussing on the
> core of
> >>>> this, trying to re-use the existing couch_mrview/couch_index
> machinery as
> >>>> much as possible. Strictly, for replication only by-access-seq would
> be
> >>>> required, but by-update-id is a little easier to do, so I’ve done that
> >>>> first, and I believe the results are encouraging.
> >>>>
> >>>> I’ve put a diff against master into a gist for your perusal:
> >>>>
> >>>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
> >>>>
> >>>>
> >>>> The core bits are:
> >>>>
> >>>>
> >>>>
> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
> >>>>
> >>>> and
> >>>>
> >>>>
> >>>>
> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
> >>>>
> >>>> Here’s an example Doc:
> >>>>
> >>>> {
> >>>> "_id":"1fb94bf8c3d5a73745f3cc4f5f000a8d”,
> >>>> "_rev":"4-bcbc975e61bdb80f3de1b87f6cad6a76”,
> >>>> "_access":["b”]
> >>>> }
> >>>>
> >>>> It shows up for user b:
> >>>>
> >>>>
> >>>> curl b:[email protected]:15984/a/_all_docs
> >>>>
> >>>> {"total_rows”:2,"offset":0,"rows":[
> >>>>
> >>>>
> {"id":"1fb94bf8c3d5a73745f3cc4f5f000a8d","key":["b","1fb94bf8c3d5a73745f3cc4f5f000a8d"],"value":"4-bcbc975e61bdb80f3de1b87f6cad6a76”}
> >>>> ]}
> >>>>
> >>>> But not for user c:
> >>>>
> >>>>
> >>>>> curl c:[email protected]:15984/a/_all_docs
> >>>>
> >>>> {"total_rows”:2,"offset":2,"rows":[
> >>>>
> >>>> ]}
> >>>>
> >>>>
> >>>> * * *
> >>>>
> >>>>
> >>>> I’d like to get some general design feedback on this approach to find
> out
> >>>> if it is worth pursuing further. See “Next Steps” way below for my
> thinking
> >>>> on how to get by-access-seq going.
> >>>>
> >>>> The rest of this email are my notes from reading the source and
> trying to
> >>>> explain my thinking as well as guide folks that might not be very
> familiar
> >>>> with the CouchDB sources to follow along what is happening.
> >>>>
> >>>> I’d especially like to get some feedback about this from some of the
> folks
> >>>> here who don’t spend their days in the main Erlang codebase :)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>> Thanks!
> >>>> Jan
> >>>>
> >>>> * * *
> >>>>
> >>>> CouchDB Access Notes
> >>>>
> >>>> Background:
> >>>>
> https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E
> >>>>
> >>>> # Overview
> >>>>
> >>>> To solve the problems with the db-per-user pattern, we want to
> introduce
> >>>> document level access control. The result should be a single CouchDB
> >>>> database that can be used by multiple mutually untrusting users while
> >>>> retaining CouchDB’s full semantics.
> >>>>
> >>>> // TODO: link to appendix: problems with db-per-user
> >>>>
> >>>> We decided on an approach to define access control in documents with
> a new
> >>>> property `_access` which is specified as an array of strings and
> arrays.
> >>>> Strings represent usernames and roles, sub-arrays are used as logical
> AND,
> >>>> elements in the top level array are used as logical OR. For example.
> an
> >>>> _access field with the value [[‘management’, ‘senior’], ‘ceo-jane’]
> would
> >>>> allow everyone with the roles ‘management’ AND ‘senior’, OR the user
> >>>> ‘ceo-jane’ access to that doc. but not e.g. users with roles
> ‘development’,
> >>>> ‘senior’, nor user ‘vp-jenn’.
> >>>>
> >>>> To achieve main CouchDB semantics, we need to introduce new behaviour
> for
> >>>> the _all_docs and _changes endpoints. The plan is to special case-this
> >>>> based on the authenticated user context (userCtx, e.g, username and
> >>>> associated roles, after authentication).
> >>>>
> >>>> The existing by-id and by-seq indexes are not equipped to efficiently
> >>>> return results per user, so we are introducing two new indexes
> (either can
> >>>> be optionally configured, depending on the use-case and performance
> and
> >>>> storage needs): by-access-id and by-access-seq. In contrast with
> by-id and
> >>>> by-seq, these indexes are not stored in the main database file, but
> in a
> >>>> separate file, ideally managed by the existing couch_index
> infrastructure.
> >>>>
> >>>>
> >>>> # Development considerations
> >>>>
> >>>> This first spike is only concerned with getting per-access-id to work
> with
> >>>> minimal effort.
> >>>>
> >>>> To get started, let’s look at how _all_docs works today using the
> by-id
> >>>> index.
> >>>>
> >>>> ## The Anatomy of a Clustered _all_docs Request
> >>>>
> >>>> CouchDB’s clustering layer consists of three main modules: chttpd,
> fabric
> >>>> and refi. chttpd’s job is to handle everything HTTP and route
> requests to
> >>>> the right place in the rest of the code. It’s a HTTP router, mapping
> URLs,
> >>>> request methods and options to handler functions that do with the
> work the
> >>>> requests are specified to fulfil.
> >>>>
> >>>> fabric’s job is to distribute a single request from the outside to
> >>>> multiple nodes of the cluster. Some requests require only talking to
> the
> >>>> local node, but that’s less important for the moment. fabric includes
> >>>> fabric_rpc, a module that turns a request to the cluster into one or
> more
> >>>> requests to other nodes in the cluster.
> >>>>
> >>>> rexi’s job is know about the cluster state: which nodes are in the
> >>>> cluster, which of them are active/reachable/failed, which shards live
> on
> >>>> which nodes. fabric uses rexi to know which nodes to contact for which
> >>>> shards.
> >>>>
> >>>> After a bit of indirection, we find ourselves at the first
> >>>> _all_docs-specific function in chttpd_db.erl: all_docs_view/4:
> >>>>
> >>>> ```
> >>>> all_docs_view(Req, Db, Keys, OP) ->
> >>>>  Args0 = couch_mrview_http:parse_params(Req, Keys),
> >>>>  Args1 = Args0#mrargs{view_type=map},
> >>>>  Args2 = couch_mrview_util:validate_args(Args1),
> >>>>  Args3 = set_namespace(OP, Args2),
> >>>>  Options = [{user_ctx, Req#httpd.user_ctx}],
> >>>>  Max = chttpd:chunked_response_buffer_size(),
> >>>>  VAcc = #vacc{db=Db, req=Req, threshold=Max},
> >>>>  {ok, Resp} = fabric:all_docs(Db, Options, fun
> >>>> couch_mrview_http:view_cb/2, VAcc, Args3),
> >>>>  {ok, Resp#vacc.resp}.
> >>>> ```
> >>>>
> >>>> The first five lines handle query options and request parameters or
> >>>> arguments. The next three lines are the bulk of the job: start a
> response,
> >>>> call fabric:all_docs/5 with a callback to handle rows. The last line
> >>>> returns the accumulator that is returned by fabric:all_docs/5.
> >>>>
> >>>> fabric:all_docs/5 is a thin wrapper around fabric_view_all_docs:go/5.
> >>>> Before we jump down, we notice that there is also a
> >>>> fabric_view_changes.erl, which we should remember for the next
> iteration
> >>>> when we implement by-access-seq.
> >>>>
> >>>> go/5 comes in two variants and we’ll ignore the second here for the
> >>>> moment, because it is a performance optimisation. The main work for
> go/5 is
> >>>> in the top third of the function. First it gets all shards for the
> current
> >>>> database from mem3, then it starts a fabric_rpc worker for each
> shard, and
> >>>> then waits for the results to come back by calling go/6 with all
> workers.
> >>>> The bottom two thirds are timeout and error handling.
> >>>>
> >>>> go/6 registers the handle_message/3 function as the callback for
> >>>> rexi_utils’ recv/6 (read “receive”) function.
> >>>>
> >>>> handle_message/3 comes in a number of variants to handle rexi errors,
> >>>> receiving metadata, receiving result rows and a notification
> “complete”
> >>>> about all rows having been sent.
> >>>>
> >>>> Our next level down is looking into fabric_rpc and how it handles
> all_docs
> >>>> requests. fabric_rpc/3 is again a short wrapper, this time around
> >>>> couch_mrview:query_all_docs/4 which is the node-local function that
> handles
> >>>> querying.
> >>>>
> >>>> couch_mrview includes a bunch of functions map/reduce views. It seems
> like
> >>>> a natural place doing our distinction between a normal by-id request
> and a
> >>>> by-access-id request.
> >>>>
> >>>> I’m skipping a step here, but with a little printf-debugging, I’ve
> found
> >>>> out that the `Db` variable we get passed in, includes the
> authenticated
> >>>> userCtx including username and any roles.  We can use
> couch_db:is_admin/1
> >>>> to get a boolean back for the distinction we are going to have to
> make:
> >>>>
> >>>> ```
> >>>> query_all_docs(Db, Args0, Callback, Acc) ->
> >>>>  case couch_db:is_admin(Db) of
> >>>>      true -> query_all_docs_admin(Db, Args0, Callback, Acc);
> >>>>      false -> query_all_docs_access(Db, Args0, Callback, Acc)
> >>>>  end.
> >>>> ```
> >>>>
> >>>> query_all_docs_admin/4 is the existing query_all_docs/4 function and
> we’re
> >>>> introducing query_all_docs_access/4, that we now have to fill out with
> >>>> querying our view.
> >>>>
> >>>> Before we can do that, we need to understand how view work.
> >>>>
> >>>> ## The Anatomy of a View Request
> >>>>
> >>>> Querying a view has three stages:
> >>>>
> >>>> 1. define the view
> >>>> 2. build the view index
> >>>> 3. query the view index
> >>>>
> >>>> A view definition is always in a design document. It can be one or
> >>>> JavaScript map/reduce functions, Erlang map/reduce functions, or a
> mango
> >>>> index definition.
> >>>>
> >>>> // TODO: link all these view definition options.
> >>>>
> >>>> Building the view index is an implicit step in CouchDB. View indexes
> are
> >>>> refreshed at query time, but only if there were any changes in the
> database
> >>>> since the last query. If no refresh is needed, the view result is
> returned
> >>>> from the index directly.
> >>>>
> >>>> // TODO: explain query_server
> >>>>
> >>>> Querying indexes follows a similar path through chttpd, fabric, rexi,
> >>>> fabric_rpc down to the per-node handlers in couch_mrview. Just a few
> lines
> >>>> below couch_mrview:query_all_docs/4 we find query_view/5 which decides
> >>>> between map and reduce requests. We care about map-only for now.
> >>>> query_view/5 is preceded by query_view/6 which includes a call to
> >>>> couch_mrview_util:get_view/4 which looks like it is where we want to
> look
> >>>> next, as the map_fold/5 called by query_view/5 is about looping over
> rows.
> >>>> We hope we can re-use all that logic, and maybe get_view/4 lets us
> find out
> >>>> how we can have it return our new view.
> >>>>
> >>>> get_view/4 calls get_view_index_state/4 which in turn calls
> >>>> get_view_index_pid/4 that finally calls into
> couch_index_server:get_index/4
> >>>> which looks like it returns the index for our request. Let’s have a
> look.
> >>>>
> >>>> get_index/4 will dive into get_index/2 eventually and that looks
> indeed
> >>>> like where we need to look. In there, we look up view index in an ETS
> table
> >>>> (an in-memory database), and if it can’t find it there, start a new
> one.
> >>>> Either way, a view index is returned. The lookup is by DbName and
> >>>> Sig(nature), an md5 hash over the `views` property in a design doc,
> that
> >>>> also corresponds to the *.view filename of the view index.
> >>>>
> >>>>
> >>>> ## Faking the index
> >>>>
> >>>> So how would we get this to return the index we want to query? We
> need to
> >>>> create an index definition that matches the design doc `views` hash.
> Hm.
> >>>>
> >>>> It is relatively easy to produce a map function that behaves like we
> want:
> >>>>
> >>>> function (doc) {
> >>>> var _access = doc.access
> >>>> if (!_access) { return }
> >>>> if (!isArray(_access) || _access,length === 0) { return }
> >>>> _access.forEach( function (user_or_role) {
> >>>>  emit([user_or_role, doc._id], doc._rev)
> >>>> })
> >>>> }
> >>>>
> >>>> At query time, we’d have to match the requesting username and roles
> >>>> against the first element in the key-array and return the results,
> while
> >>>> replacing the key-array with the second element (the doc _id). All
> this
> >>>> doesn’t sound too hard. Good.
> >>>>
> >>>> One snag though: if we think ahead and try to see how we could
> implement
> >>>> by-access-changes we get stuck: a view does not include rows for
> deleted
> >>>> documents while _changes does. In addition, the update sequence for a
> >>>> document is not available in a map function. So a regular view can
> not be
> >>>> used here.
> >>>>
> >>>> The filtering of deleted docs from a view index happens in
> >>>> couch_mrview:map_fold/3. So if we could augment that for our internal
> view
> >>>> requests, that could get us a long way towards reusing the rest of the
> >>>> couch_mrview/couch_index machinery.
> >>>>
> >>>> Note to self: make sure view compaction doesn’t remove deleted docs.
> But a
> >>>> cursory glance at couch_mrview_compactor:compact_view_btree/5
> suggests no
> >>>> such thing, but we need to validate this, and if it doesn’t hold,
> change
> >>>> view_compation to keep deleted entries.
> >>>>
> >>>> * * *
> >>>>
> >>>> We’ll start giving this a try by forking things off in
> >>>> couch_mrview:query_all_docs/4 and pretending to call a view with a
> mocked
> >>>> ddoc:
> >>>>
> >>>> {
> >>>> “_id”: “_design/_access”,
> >>>> “language”: “_access”
> >>>> “views”: {} // if needed
> >>>> } // TODO see which other fields it needs
> >>>>
> >>>> We’ll try this road to see if we get to the point where we get a “view
> >>>> index not found” error, because we didn’t actually have a view index
> yet.
> >>>> We’ll then try and see if we can produce one. We could try the other
> way
> >>>> around too, building the index first and then trying to query, but the
> >>>> approach doesn’t make much of a difference.
> >>>>
> >>>> First demo working:
> >>>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
> >>>>
> >>>>
> >>>> Next Steps:
> >>>> - make sure the startkey/endkey/descending argument handling is all
> >>>> correct and complete
> >>>> - add key un-munging, so the user/role prefix gets filtered out on
> reads
> >>>> - handle roles:
> >>>>  - instead of querying the _access view once, we need to issue a
> >>>> multi-query, probably via #mrags.multi_get, read up on how that is
> used
> >>>> - then we could start thinking about by-access-seq:
> >>>>  - we need access to the update-seq in
> >>>> couch_access_native_proc:map_doc, might require view protocol
> upgrade, or
> >>>> we have a post-process function that tags on the update-seq, we’ll
> see.
> >>>>  - the admin/access split we’re doing in query_all_docs should
> probably
> >>>> happen in couch_db:changes_since/5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> # More specification details
> >>>>
> >>>>
> >>>> Documents with in databases with _access enabled are
> private/admin-only by
> >>>> default, and can be made public with the special role _public
> >>>>
> >>>> TODO: shared id space or auto-prefix ids
> >>>>
> >>>>
> >>>>
> >>
> >
>
> --
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/
>
>

Scaling db-per-user

Reply via email to