Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Konstantin Boudnik Mon, 25 Feb 2013 16:18:59 -0800

[yanking away most of the cross-posts...]

An interesting cross component project Avik. Any plans to incubate it in Apache?


Cos

On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote:
> Project Rhino
> 
> As the Apache Hadoop ecosystem extends into new markets and sees new use
> cases with security and compliance challenges, the benefits of processing
> sensitive and legally protected data with Hadoop must be coupled with
> protection for private information that limits performance impact. Project
> Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source
> effort to enhance the existing data protection capabilities of the Hadoop
> ecosystem to address these challenges, and contribute the code back to
> Apache.
> 
> The core of the Apache Hadoop ecosystem as it is commonly understood is:
> 
> - Core: A set of shared libraries
> - HDFS: The Hadoop filesystem
> - MapReduce: Parallel computation framework
> - ZooKeeper: Configuration management and coordination
> - HBase: Column-oriented database on HDFS
> - Hive: Data warehouse on HDFS with SQL-like access
> - Pig: Higher-level programming language for Hadoop computations
> - Oozie: Orchestration and workflow management
> - Mahout: A library of machine learning and data mining algorithms
> - Flume: Collection and import of log and event data
> - Sqoop: Imports data from relational databases
> 
> These components are all separate projects and therefore cross cutting 
> concerns like authN, authZ, a consistent security policy framework, 
> consistent authorization model and audit coverage are loosely coordinated. 
> Some security features expected by our customers, such as encryption, are 
> simply missing. Our aim is to take a full stack view and work with the 
> individual projects toward consistent concepts and capabilities, filling gaps 
> as we go.
> 
> Our initial goals are:
> 
> 1) Framework support for encryption and key management
> 
> There is currently no framework support for encryption or key management. We 
> will add this support into Hadoop Core and integrate it across the ecosystem.
> 
> 2) A common authorization framework for the Hadoop ecosystem
> 
> Each component currently has its own authorization engine. We will abstract 
> the common functions into a reusable authorization framework with a 
> consistent interface. Where appropriate we will either modify an existing 
> engine to work within this framework, or we will plug in a common default 
> engine. Therefore we also must normalize how security policy is expressed and 
> applied by each component. Core, HDFS, ZooKeeper, and HBase currently support 
> simple access control lists (ACLs) composed of users and groups. We see this 
> as a good starting point. Where necessary we will modify components so they 
> each offer equivalent functionality, and build support into others.
> 
> 3) Token based authentication and single sign on
> 
> Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication at 
> the RPC layer, via SASL. However this does not provide valuable attributes 
> such as group membership, classification level, organizational identity, or 
> support for user defined attributes. Hadoop components must interrogate 
> external resources for discovering these attributes and at scale this is 
> problematic. There is also no consistent delegation model. HDFS has a simple 
> delegation capability, and only Oozie can take limited advantage of it. We 
> will implement a common token based authentication framework to decouple 
> internal user and service authentication from external mechanisms used to 
> support it (like Kerberos).
> 
> 4) Extend HBase support for ACLs to the cell level
> 
> Currently HBase supports setting access controls at the table or column 
> family level. However, many use cases would benefit from the additional 
> capability to do this on a per cell basis. In fact for many users dealing 
> with sensitive information the ability to do this is crucial.
> 
> 5) Improve audit logging
> 
> Audit messages from various Hadoop components do not use a unified or even 
> consistently formatted format. This makes analysis of logs for verifying 
> compliance or taking corrective action difficult. We will build a common 
> audit logging facility as part of the common authorization framework work. We 
> will also build a set of common audit log processing tools for transforming 
> them to different industry standard formats, for supporting compliance 
> verification, and for triggering responses to policy violations.
> 
> Current JIRAs:
> 
> As part of this ongoing effort we are contributing our work to-date against 
> the JIRAs listed below. As you may appreciate, the goals for Project Rhino 
> covers a number of different Apache projects, the scope of work is 
> significant and likely to only increase as we get additional community input. 
> We also appreciate that there may be others in the Apache community that may 
> be working on some of this or are interested in contributing to it. If so, we 
> look forward to partnering with you in Apache to accelerate this effort so 
> the Apache community can see the benefits from our collective efforts sooner. 
> You can also find a more detailed version of this announcement at Project 
> Rhino<https://github.com/intel-hadoop/project-rhino/>.
> 
> Please feel free to reach out to us by commenting on the JIRAs below:
> 
> HBASE-6222: Add per-KeyValue 
> Security<https://issues.apache.org/jira/browse/hbase-6222>
> 
> HADOOP-9331: Hadoop crypto codec framework and crypto codec 
> implementations<https://issues.apache.org/jira/browse/hadoop-9331> and 
> related sub-tasks
> 
> MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec 
> in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025> and 
> related JIRAs
> 
> HBASE-7544: Transparent table/CF 
> encryption<https://issues.apache.org/jira/browse/hbase-7544>
>

Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Reply via email to