[yanking away most of the cross-posts...] An interesting cross component project Avik. Any plans to incubate it in Apache?
Cos On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote: > Project Rhino > > As the Apache Hadoop ecosystem extends into new markets and sees new use > cases with security and compliance challenges, the benefits of processing > sensitive and legally protected data with Hadoop must be coupled with > protection for private information that limits performance impact. Project > Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source > effort to enhance the existing data protection capabilities of the Hadoop > ecosystem to address these challenges, and contribute the code back to > Apache. > > The core of the Apache Hadoop ecosystem as it is commonly understood is: > > - Core: A set of shared libraries > - HDFS: The Hadoop filesystem > - MapReduce: Parallel computation framework > - ZooKeeper: Configuration management and coordination > - HBase: Column-oriented database on HDFS > - Hive: Data warehouse on HDFS with SQL-like access > - Pig: Higher-level programming language for Hadoop computations > - Oozie: Orchestration and workflow management > - Mahout: A library of machine learning and data mining algorithms > - Flume: Collection and import of log and event data > - Sqoop: Imports data from relational databases > > These components are all separate projects and therefore cross cutting > concerns like authN, authZ, a consistent security policy framework, > consistent authorization model and audit coverage are loosely coordinated. > Some security features expected by our customers, such as encryption, are > simply missing. Our aim is to take a full stack view and work with the > individual projects toward consistent concepts and capabilities, filling gaps > as we go. > > Our initial goals are: > > 1) Framework support for encryption and key management > > There is currently no framework support for encryption or key management. We > will add this support into Hadoop Core and integrate it across the ecosystem. > > 2) A common authorization framework for the Hadoop ecosystem > > Each component currently has its own authorization engine. We will abstract > the common functions into a reusable authorization framework with a > consistent interface. Where appropriate we will either modify an existing > engine to work within this framework, or we will plug in a common default > engine. Therefore we also must normalize how security policy is expressed and > applied by each component. Core, HDFS, ZooKeeper, and HBase currently support > simple access control lists (ACLs) composed of users and groups. We see this > as a good starting point. Where necessary we will modify components so they > each offer equivalent functionality, and build support into others. > > 3) Token based authentication and single sign on > > Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication at > the RPC layer, via SASL. However this does not provide valuable attributes > such as group membership, classification level, organizational identity, or > support for user defined attributes. Hadoop components must interrogate > external resources for discovering these attributes and at scale this is > problematic. There is also no consistent delegation model. HDFS has a simple > delegation capability, and only Oozie can take limited advantage of it. We > will implement a common token based authentication framework to decouple > internal user and service authentication from external mechanisms used to > support it (like Kerberos). > > 4) Extend HBase support for ACLs to the cell level > > Currently HBase supports setting access controls at the table or column > family level. However, many use cases would benefit from the additional > capability to do this on a per cell basis. In fact for many users dealing > with sensitive information the ability to do this is crucial. > > 5) Improve audit logging > > Audit messages from various Hadoop components do not use a unified or even > consistently formatted format. This makes analysis of logs for verifying > compliance or taking corrective action difficult. We will build a common > audit logging facility as part of the common authorization framework work. We > will also build a set of common audit log processing tools for transforming > them to different industry standard formats, for supporting compliance > verification, and for triggering responses to policy violations. > > Current JIRAs: > > As part of this ongoing effort we are contributing our work to-date against > the JIRAs listed below. As you may appreciate, the goals for Project Rhino > covers a number of different Apache projects, the scope of work is > significant and likely to only increase as we get additional community input. > We also appreciate that there may be others in the Apache community that may > be working on some of this or are interested in contributing to it. If so, we > look forward to partnering with you in Apache to accelerate this effort so > the Apache community can see the benefits from our collective efforts sooner. > You can also find a more detailed version of this announcement at Project > Rhino<https://github.com/intel-hadoop/project-rhino/>. > > Please feel free to reach out to us by commenting on the JIRAs below: > > HBASE-6222: Add per-KeyValue > Security<https://issues.apache.org/jira/browse/hbase-6222> > > HADOOP-9331: Hadoop crypto codec framework and crypto codec > implementations<https://issues.apache.org/jira/browse/hadoop-9331> and > related sub-tasks > > MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec > in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025> and > related JIRAs > > HBASE-7544: Transparent table/CF > encryption<https://issues.apache.org/jira/browse/hbase-7544> >