Chris, thank you for expressing the problem in such succinct terms. My problem does appear to be one of RBAC versus ABAC.
Josh, thanks for your observations. I will try to summarize my updated understanding of the issue based on your replies: At its core, Accumulo appears to encourage ABAC by mandating that the data be *classified* with visibility labels, and then building RBAC (if needed) on top of this by authorizing users to access certain classifications. I guess this model fits well for data that is amenable for mining classifications (e.g., SSNs, email addresses, phone numbers). However, in my case, the data I am dealing with is homogeneous in nature, and the actual classification for visibility will be performed by examining the value rather than the nature of the data. Therefore, I can go with RBAC directly as ABAC will be of little use to this type of system. Do let me know if my observation above has any inaccuracies. As a side note, it did help me a lot to think about visibilities as *data classifications* rather than visibilities, considering that there are so many similar-sounding terms in the Accumulo security model (authentication, permissions, authorization, ...) Once again, thank you for your help. Srikanth Viswanathan On Mon, Feb 16, 2015 at 7:06 PM, Josh Elser <josh.el...@gmail.com> wrote: > I think A1 is ultimately the right thing, as well. > > The problem is not that you don't know how to accurately label your data > (which is the biggest problem in Accumulo as updating the visibility is very > costly), it's that it's hard to be able to add your enrichment data after > the fact. > > The reason that's hard, though, is because your enrichment client needs act > like a client -- have authorizations to read the original data. It seems > reasonable to me to try to tackle the problem of ensuring the process that > needs to enrich some data has the appropriate authorizations to read that > data. > > Christopher wrote: >> >> I think part of your question pertains to the differences between ABAC >> (attribute-based access controls) and RBAC (role-based access controls). >> >> In both A1 and A2, you're thinking in terms of RBAC. The only real >> differences is whether you want to have one additional role, or >> repurpose the existing ones. However, Accumulo's data visibilities are >> more like ABAC. Of course, you can use whatever method works for you, >> but the intent is more ABAC than RBAC. >> >> The main pitfall with RBAC is that roles and users change, and data is >> complex and large and you don't want to re-write it when things change. >> However, attributes are properties of the data itself, upon which you >> can make access decisions. These attributes should be things that don't >> change... they are inherent to the data (ideal). >> >> To think in terms of ABAC, the main question to ask is "What properties >> of this data element will determine who can access it?". For example, >> does it contain personal information or medical history? Does it contain >> usernames and email addresses? What is it about this data that makes it >> worth protecting? Does it need to be protected? I think that's mainly >> what John Vines' talk was about (the differences between RBAC and ABAC). >> >> If RBAC is more appropriate for your data, I'd probably go with A1, >> because it's easier to implement and maintain. The biggest drawback is >> that you require additional storage space to store the additional role >> in each visibility. Because of some internal optimizations, if you go >> this route, I'd recommend making this role a prefix, rather than a >> suffix "SUPERUSER|(restOfVisibility)" vs. "(restOfVisibility)|SUPERUSER". >> >> >> -- >> Christopher L Tubbs II >> http://gravatar.com/ctubbsii >> >> On Mon, Feb 16, 2015 at 5:39 PM, Srikanth Viswanathan >> <srikant...@gmail.com <mailto:srikant...@gmail.com>> wrote: >> >> Hello, >> >> I'm using Accumulo to store raw and value-added data and expose this >> data to a small number of end users. During ingestion, the system will >> connect to accumulo as a single accumulo user called, say, "ingestor". >> This user will first store data, and then later in the ingestion >> pipeline read the same data back to add value and write the >> value-added data back. End-users will connect as themselves (i.e., >> individual accumulo accounts) to read the data. >> >> The questions I am facing are: >> Q1. How to manage the read authorizations for the ingestor? >> Q2. How to ensure data in accumulo is never orphaned due to current >> users lacking authorizations to read certain columns? >> >> It seems to me that I have two options, both of which will solve both >> my problems above: >> A1. Grant the ingestor a single authorization and store the data with >> labels that allow the ingestor access via this label. e.g., >> "ingestor|(foo_end_user_group|bar_end_user_group)". By doing this, I >> don't have to maintain special authorization logic for the ingestor, >> and I can also fall back on it to read data that might otherwise be >> orphaned. >> A2. Store only the end user groups in the visibility labels >> ("foo_end_user_group|bar_end_user_group"), and >> force the ingestion user to obtain all group authorizations needed in >> order to read the data. This will require special logic to update the >> ingestor's authorizations when a new authorization is added to the >> system. >> >> A1 seems simpler to me, but I heard John Vines discourage this in his >> talk at the 2014 Accumulo Summit. Doesn't the user in either case see >> the same set of data (i.e., "everything"). What then are the potential >> pitfalls of A1 compared to A2? >> >> Thank you! >> >> Srikanth Viswanathan >> >> >