[ 
https://issues.apache.org/jira/browse/HADOOP-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13455698#comment-13455698
 ] 

Steve Loughran commented on HADOOP-8803:
----------------------------------------

Are you proposing that Hadoop (more precisely HDFS) would run in public cloud 
infrastructure without any kind of network-layer protections? Because security 
isn't enough to prevent things like DDoS attacks, and it opens up your entire 
cluster's dataset to 0-day exploits. 

Irrespective of what is done with kerberos, byte level security, etc. I would 
never tell anyone to bring up a Hadoop cluster in public infrastructure without 
isolating its network by way of iptables, VPN or whatever -with proxies or 
access restricted to in-cluster hosts in a DMZ-style setup.

Some cloud infrastructures do let you specify the network structure (VMware and 
VBox based systems included), you can do the same with KVM-based systems if the 
tooling is right (specifically network drivers in the host system). Isolation 
must be at this level not at the app layer, because you can never be 100% sure 
that you've fixed all security bugs.

Oh, and EC2 bills you for all net traffic that gets passed the router specs 
that you've declared, so if you do have a wide open system then you get to pay 
a lot for all the traffic you are rejecting.

I can see that your goal of limiting the access of a TT & spawned task(s) to 
the subset of a fileset that they are working with seems like a good goal -but 
consider that over time the amount of work sent to a TT means that even a 
compromised machine would get at more data over time. If it's ruthless it could 
signal fake job completion events to get the data faster than the MR job would 
do, so get more work sent to it, so collect more data than other machines.

You also need to consider that the blocks inside the DN could be compromised, 
they'd all have to be encrypted by whatever was writing them; the keys to 
decrypt passed down to the tasks. 

In a cloud infrastructure the tactic you'd adopt for security relies on VM 
images -you'd roll the VM back to the previous image regularly, either every 59 
minutes (cost effective), or every job. You need to think about DN 
decommissioning here too, but it's a better story -it's the standard tactic for 
defending VMs in the DMZ from being compromised for any extended period of time.
                
> Make Hadoop running more secure public cloud envrionment
> --------------------------------------------------------
>
>                 Key: HADOOP-8803
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8803
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, ipc, security
>    Affects Versions: 0.20.204.0
>            Reporter: Xianqing Yu
>              Labels: hadoop
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> I am a Ph.D student in North Carolina State University. I am modifying the 
> Hadoop's code (which including most parts of Hadoop, e.g. JobTracker, 
> TaskTracker, NameNode, DataNode) to achieve better security.
>  
> My major goal is that make Hadoop running more secure in the Cloud 
> environment, especially for public Cloud environment. In order to achieve 
> that, I redesign the currently security mechanism and achieve following 
> proprieties:
> 1. Bring byte-level access control to Hadoop HDFS. Based on 0.20.204, HDFS 
> access control is based on user or block granularity, e.g. HDFS Delegation 
> Token only check if the file can be accessed by certain user or not, Block 
> Token only proof which block or blocks can be accessed. I make Hadoop can do 
> byte-granularity access control, each access party, user or task process can 
> only access the bytes she or he least needed.
> 2. I assume that in the public Cloud environment, only Namenode, secondary 
> Namenode, JobTracker can be trusted. A large number of Datanode and 
> TaskTracker may be compromised due to some of them may be running under less 
> secure environment. So I re-design the secure mechanism to make the damage 
> the hacker can do to be minimized.
>  
> a. Re-design the Block Access Token to solve wildly shared-key problem of 
> HDFS. In original Block Access Token design, all HDFS (Namenode and Datanode) 
> share one master key to generate Block Access Token, if one DataNode is 
> compromised by hacker, the hacker can get the key and generate any  Block 
> Access Token he or she want.
>  
> b. Re-design the HDFS Delegation Token to do fine-grain access control for 
> TaskTracker and Map-Reduce Task process on HDFS. 
>  
> In the Hadoop 0.20.204, all TaskTrackers can use their kerberos credentials 
> to access any files for MapReduce on HDFS. So they have the same privilege as 
> JobTracker to do read or write tokens, copy job file, etc.. However, if one 
> of them is compromised, every critical thing in MapReduce directory (job 
> file, Delegation Token) is exposed to attacker. I solve the problem by making 
> JobTracker to decide which TaskTracker can access which file in MapReduce 
> Directory on HDFS.
>  
> For Task process, once it get HDFS Delegation Token, it can access everything 
> belong to this job or user on HDFS. By my design, it can only access the 
> bytes it needed from HDFS.
>  
> There are some other improvement in the security, such as TaskTracker can not 
> know some information like blockID from the Block Token (because it is 
> encrypted by my way), and HDFS can set up secure channel to send data as a 
> option.
>  
> By those features, Hadoop can run much securely under uncertain environment 
> such as Public Cloud. I already start to test my prototype. I want to know 
> that whether community is interesting about my work? Is that a value work to 
> contribute to production Hadoop?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to