Thanks Adam,

we talked about limiting the number of roles that a user could have to I think 
10 to keep the multi-query complexity at bay. And I think we also talked about 
just keeping the individual segment update-seq’s around, but we didn’t speak 
about the size/complexity of the combined seq-id if I recall correctly.

If n^2-1 when n <= 10 is acceptable for seq-id size, we’re on track. If not, 
that’s a TBD.

My notes say I wanna tackle roles next, but given your feedback, I think I’ll 
try and get the username-only version of by-access-id and by-access-seq working 
first. That’s a good enough milestone to see through, and maybe even ship, 
before diving into roles right away.

That said, there’s quite a bit of work left, so I’m not in a hurry figuring out 
roles and adding that pre-shipping.

* * *

I’ve updated the gist and _all_docs now has a an un-munged key member, that’s 
just the doc-id.

The next iteration of this will live in a branch & PR, so we can discuss 
details there.

Best
Jan
--


> On 16. Nov 2017, at 04:26, Adam Kocoloski <kocol...@apache.org> wrote:
> 
> Hi Jan,
> 
> I took a closer read and I do think you’re on the right path. I certainly 
> agree with reusing the secondary index machinery to create the extra internal 
> indexes.
> 
> On the by-access-seq index … did we ever discuss how to efficiently track and 
> report the last observed sequences from the various ranges of the index to 
> which a user has access? I suppose the single seq from each contributing 
> shard could change to an array of seqs, one from each range. I do worry about 
> the size of the merged sequence (I’m remembering the 2^n-1 possible role 
> combinations granting access for a user possessing n roles). I didn’t see 
> anything in the summit notes.
> 
> Adam
> 
>> On Nov 15, 2017, at 4:35 PM, Jan Lehnardt <j...@apache.org> wrote:
>> 
>> Hi all,
>> 
>> in the midst of handling the security stuff I had a moment of clarity how 
>> the often requested per document permissions could be implemented. We had 
>> discussed a potential approach extensively in the February Boston Developer 
>> Summit (notes here: 
>> https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E)
>> 
>> What was so alluring about this proposal was that it solves per doc access 
>> control and per-user-db in one go. E.g. it would be able to share a single 
>> database with multiple distrusting users, allow them to have their own set 
>> of views, and even independently use their share of a single database as a 
>> replication endpoint without interfering with any of the other users on that 
>> database.
>> 
>> I gave it a shot. Essentially, we need to build new indexes: by-access-id 
>> and by-access-seq to make all that work. I’m just focussing on the core of 
>> this, trying to re-use the existing couch_mrview/couch_index machinery as 
>> much as possible. Strictly, for replication only by-access-seq would be 
>> required, but by-update-id is a little easier to do, so I’ve done that 
>> first, and I believe the results are encouraging.
>> 
>> I’ve put a diff against master into a gist for your perusal:
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
>> 
>> 
>> The core bits are:
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
>> 
>> and
>> 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc#file-by-access-id-diff-L189-L215
>> 
>> Here’s an example Doc:
>> 
>> {
>> "_id":"1fb94bf8c3d5a73745f3cc4f5f000a8d”,
>> "_rev":"4-bcbc975e61bdb80f3de1b87f6cad6a76”,
>> "_access":["b”]
>> }
>> 
>> It shows up for user b:
>> 
>> 
>> curl b:b@127.0.0.1:15984/a/_all_docs
>> 
>> {"total_rows”:2,"offset":0,"rows":[
>> {"id":"1fb94bf8c3d5a73745f3cc4f5f000a8d","key":["b","1fb94bf8c3d5a73745f3cc4f5f000a8d"],"value":"4-bcbc975e61bdb80f3de1b87f6cad6a76”}
>> ]}
>> 
>> But not for user c:
>> 
>> 
>>> curl c:c@127.0.0.1:15984/a/_all_docs
>> 
>> {"total_rows”:2,"offset":2,"rows":[
>> 
>> ]}
>> 
>> 
>> * * *
>> 
>> 
>> I’d like to get some general design feedback on this approach to find out if 
>> it is worth pursuing further. See “Next Steps” way below for my thinking on 
>> how to get by-access-seq going.
>> 
>> The rest of this email are my notes from reading the source and trying to 
>> explain my thinking as well as guide folks that might not be very familiar 
>> with the CouchDB sources to follow along what is happening.
>> 
>> I’d especially like to get some feedback about this from some of the folks 
>> here who don’t spend their days in the main Erlang codebase :)
>> 
>> Let me know what you think.
>> 
>> Thanks!
>> Jan
>> 
>> * * *
>> 
>> CouchDB Access Notes
>> 
>> Background: 
>> https://lists.apache.org/thread.html/09a5686bca8049010b82796cc0fe99ef27aed4983a3f02fd6956259f@%3Cdev.couchdb.apache.org%3E
>> 
>> # Overview
>> 
>> To solve the problems with the db-per-user pattern, we want to introduce 
>> document level access control. The result should be a single CouchDB 
>> database that can be used by multiple mutually untrusting users while 
>> retaining CouchDB’s full semantics.
>> 
>> // TODO: link to appendix: problems with db-per-user
>> 
>> We decided on an approach to define access control in documents with a new 
>> property `_access` which is specified as an array of strings and arrays. 
>> Strings represent usernames and roles, sub-arrays are used as logical AND, 
>> elements in the top level array are used as logical OR. For example. an 
>> _access field with the value [[‘management’, ‘senior’], ‘ceo-jane’] would 
>> allow everyone with the roles ‘management’ AND ‘senior’, OR the user 
>> ‘ceo-jane’ access to that doc. but not e.g. users with roles ‘development’, 
>> ‘senior’, nor user ‘vp-jenn’.
>> 
>> To achieve main CouchDB semantics, we need to introduce new behaviour for 
>> the _all_docs and _changes endpoints. The plan is to special case-this based 
>> on the authenticated user context (userCtx, e.g, username and associated 
>> roles, after authentication).
>> 
>> The existing by-id and by-seq indexes are not equipped to efficiently return 
>> results per user, so we are introducing two new indexes (either can be 
>> optionally configured, depending on the use-case and performance and storage 
>> needs): by-access-id and by-access-seq. In contrast with by-id and by-seq, 
>> these indexes are not stored in the main database file, but in a separate 
>> file, ideally managed by the existing couch_index infrastructure.
>> 
>> 
>> # Development considerations
>> 
>> This first spike is only concerned with getting per-access-id to work with 
>> minimal effort.
>> 
>> To get started, let’s look at how _all_docs works today using the by-id 
>> index.
>> 
>> ## The Anatomy of a Clustered _all_docs Request
>> 
>> CouchDB’s clustering layer consists of three main modules: chttpd, fabric 
>> and refi. chttpd’s job is to handle everything HTTP and route requests to 
>> the right place in the rest of the code. It’s a HTTP router, mapping URLs, 
>> request methods and options to handler functions that do with the work the 
>> requests are specified to fulfil.
>> 
>> fabric’s job is to distribute a single request from the outside to multiple 
>> nodes of the cluster. Some requests require only talking to the local node, 
>> but that’s less important for the moment. fabric includes fabric_rpc, a 
>> module that turns a request to the cluster into one or more requests to 
>> other nodes in the cluster.
>> 
>> rexi’s job is know about the cluster state: which nodes are in the cluster, 
>> which of them are active/reachable/failed, which shards live on which nodes. 
>> fabric uses rexi to know which nodes to contact for which shards.
>> 
>> After a bit of indirection, we find ourselves at the first 
>> _all_docs-specific function in chttpd_db.erl: all_docs_view/4:
>> 
>> ```
>> all_docs_view(Req, Db, Keys, OP) ->
>>   Args0 = couch_mrview_http:parse_params(Req, Keys),
>>   Args1 = Args0#mrargs{view_type=map},
>>   Args2 = couch_mrview_util:validate_args(Args1),
>>   Args3 = set_namespace(OP, Args2),
>>   Options = [{user_ctx, Req#httpd.user_ctx}],
>>   Max = chttpd:chunked_response_buffer_size(),
>>   VAcc = #vacc{db=Db, req=Req, threshold=Max},
>>   {ok, Resp} = fabric:all_docs(Db, Options, fun couch_mrview_http:view_cb/2, 
>> VAcc, Args3),
>>   {ok, Resp#vacc.resp}.
>> ```
>> 
>> The first five lines handle query options and request parameters or 
>> arguments. The next three lines are the bulk of the job: start a response, 
>> call fabric:all_docs/5 with a callback to handle rows. The last line returns 
>> the accumulator that is returned by fabric:all_docs/5.
>> 
>> fabric:all_docs/5 is a thin wrapper around fabric_view_all_docs:go/5. Before 
>> we jump down, we notice that there is also a fabric_view_changes.erl, which 
>> we should remember for the next iteration when we implement by-access-seq.
>> 
>> go/5 comes in two variants and we’ll ignore the second here for the moment, 
>> because it is a performance optimisation. The main work for go/5 is in the 
>> top third of the function. First it gets all shards for the current database 
>> from mem3, then it starts a fabric_rpc worker for each shard, and then waits 
>> for the results to come back by calling go/6 with all workers. The bottom 
>> two thirds are timeout and error handling.
>> 
>> go/6 registers the handle_message/3 function as the callback for rexi_utils’ 
>> recv/6 (read “receive”) function.
>> 
>> handle_message/3 comes in a number of variants to handle rexi errors, 
>> receiving metadata, receiving result rows and a notification “complete” 
>> about all rows having been sent.
>> 
>> Our next level down is looking into fabric_rpc and how it handles all_docs 
>> requests. fabric_rpc/3 is again a short wrapper, this time around 
>> couch_mrview:query_all_docs/4 which is the node-local function that handles 
>> querying.
>> 
>> couch_mrview includes a bunch of functions map/reduce views. It seems like a 
>> natural place doing our distinction between a normal by-id request and a 
>> by-access-id request.
>> 
>> I’m skipping a step here, but with a little printf-debugging, I’ve found out 
>> that the `Db` variable we get passed in, includes the authenticated userCtx 
>> including username and any roles.  We can use couch_db:is_admin/1 to get a 
>> boolean back for the distinction we are going to have to make:
>> 
>> ```
>> query_all_docs(Db, Args0, Callback, Acc) ->
>>   case couch_db:is_admin(Db) of
>>       true -> query_all_docs_admin(Db, Args0, Callback, Acc);
>>       false -> query_all_docs_access(Db, Args0, Callback, Acc)
>>   end.
>> ```
>> 
>> query_all_docs_admin/4 is the existing query_all_docs/4 function and we’re 
>> introducing query_all_docs_access/4, that we now have to fill out with 
>> querying our view.
>> 
>> Before we can do that, we need to understand how view work.
>> 
>> 
>> ## The Anatomy of a View Request
>> 
>> Querying a view has three stages:
>> 
>> 1. define the view
>> 2. build the view index
>> 3. query the view index
>> 
>> A view definition is always in a design document. It can be one or 
>> JavaScript map/reduce functions, Erlang map/reduce functions, or a mango 
>> index definition.
>> 
>> // TODO: link all these view definition options.
>> 
>> Building the view index is an implicit step in CouchDB. View indexes are 
>> refreshed at query time, but only if there were any changes in the database 
>> since the last query. If no refresh is needed, the view result is returned 
>> from the index directly.
>> 
>> // TODO: explain query_server
>> 
>> Querying indexes follows a similar path through chttpd, fabric, rexi, 
>> fabric_rpc down to the per-node handlers in couch_mrview. Just a few lines 
>> below couch_mrview:query_all_docs/4 we find query_view/5 which decides 
>> between map and reduce requests. We care about map-only for now. 
>> query_view/5 is preceded by query_view/6 which includes a call to 
>> couch_mrview_util:get_view/4 which looks like it is where we want to look 
>> next, as the map_fold/5 called by query_view/5 is about looping over rows. 
>> We hope we can re-use all that logic, and maybe get_view/4 lets us find out 
>> how we can have it return our new view.
>> 
>> get_view/4 calls get_view_index_state/4 which in turn calls 
>> get_view_index_pid/4 that finally calls into couch_index_server:get_index/4 
>> which looks like it returns the index for our request. Let’s have a look.
>> 
>> get_index/4 will dive into get_index/2 eventually and that looks indeed like 
>> where we need to look. In there, we look up view index in an ETS table (an 
>> in-memory database), and if it can’t find it there, start a new one. Either 
>> way, a view index is returned. The lookup is by DbName and Sig(nature), an 
>> md5 hash over the `views` property in a design doc, that also corresponds to 
>> the *.view filename of the view index.
>> 
>> 
>> ## Faking the index
>> 
>> So how would we get this to return the index we want to query? We need to 
>> create an index definition that matches the design doc `views` hash. Hm.
>> 
>> It is relatively easy to produce a map function that behaves like we want:
>> 
>> function (doc) {
>> var _access = doc.access
>> if (!_access) { return }
>> if (!isArray(_access) || _access,length === 0) { return }
>> _access.forEach( function (user_or_role) {
>>   emit([user_or_role, doc._id], doc._rev)
>> })
>> }
>> 
>> At query time, we’d have to match the requesting username and roles against 
>> the first element in the key-array and return the results, while replacing 
>> the key-array with the second element (the doc _id). All this doesn’t sound 
>> too hard. Good.
>> 
>> One snag though: if we think ahead and try to see how we could implement 
>> by-access-changes we get stuck: a view does not include rows for deleted 
>> documents while _changes does. In addition, the update sequence for a 
>> document is not available in a map function. So a regular view can not be 
>> used here.
>> 
>> The filtering of deleted docs from a view index happens in 
>> couch_mrview:map_fold/3. So if we could augment that for our internal view 
>> requests, that could get us a long way towards reusing the rest of the 
>> couch_mrview/couch_index machinery.
>> 
>> Note to self: make sure view compaction doesn’t remove deleted docs. But a 
>> cursory glance at couch_mrview_compactor:compact_view_btree/5 suggests no 
>> such thing, but we need to validate this, and if it doesn’t hold, change 
>> view_compation to keep deleted entries.
>> 
>> * * *
>> 
>> We’ll start giving this a try by forking things off in 
>> couch_mrview:query_all_docs/4 and pretending to call a view with a mocked 
>> ddoc:
>> 
>> {
>> “_id”: “_design/_access”,
>> “language”: “_access”
>> “views”: {} // if needed
>> } // TODO see which other fields it needs
>> 
>> We’ll try this road to see if we get to the point where we get a “view index 
>> not found” error, because we didn’t actually have a view index yet. We’ll 
>> then try and see if we can produce one. We could try the other way around 
>> too, building the index first and then trying to query, but the approach 
>> doesn’t make much of a difference.
>> 
>> First demo working: 
>> https://gist.github.com/janl/20b218a3f0eafbf963ee28780261f9fc
>> 
>> 
>> Next Steps:
>> - make sure the startkey/endkey/descending argument handling is all correct 
>> and complete
>> - add key un-munging, so the user/role prefix gets filtered out on reads
>> - handle roles:
>>   - instead of querying the _access view once, we need to issue a 
>> multi-query, probably via #mrags.multi_get, read up on how that is used
>> - then we could start thinking about by-access-seq:
>>   - we need access to the update-seq in couch_access_native_proc:map_doc, 
>> might require view protocol upgrade, or we have a post-process function that 
>> tags on the update-seq, we’ll see.
>>   - the admin/access split we’re doing in query_all_docs should probably 
>> happen in couch_db:changes_since/5
>> 
>> 
>> 
>> 
>> 
>> 
>> # More specification details
>> 
>> 
>> Documents with in databases with _access enabled are private/admin-only by 
>> default, and can be made public with the special role _public
>> 
>> TODO: shared id space or auto-prefix ids
>> 
>> 
> 

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Reply via email to