[ https://issues.apache.org/jira/browse/YARN-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246626#comment-16246626 ]
Clay B. commented on YARN-7468: ------------------------------- For the driving use-case, I run secure clusters (secured on the inside to keep data from leaking back out); think of them as a drop box where users can build models with restricted data. (Or my favorite analogy is a [glovebox|https://en.wikipedia.org/wiki/File:Vacuum_Dry_Box.jpg] -- things can go in but once in, they may be tainted and can't come out except by very special decontamination.) As such, I need to ensure that network-wise the cluster is reachable from/to the local HDFS'es, HBase, databases, etc. Yet, only users permissioned for data ingest jobs should reach out and pull data. We can vet for example Oozie jobs to ensure they do only as we expect but how do we keep a user from reaching out to the same HBase or HDFS (when they otherwise have access) and storing data (or how do we allow a user to push reports to a simple service)? Ideally, I'd have all the external endpoints secured to disallow this cluster from talking back except for very fine-grained allowances -- it's a big world and I can't. So, I'd like a way to setup firewall rule equivalents with some help from YARN on the secure cluster. The process I have in mind looks like the following workflow: 1A. We would setup iptables rules statically beforehand to ensure traffic for the various YARN agreed upon cgroup contexts, bridge devices or network namespaces could only flow where we want; we'd do this via out-of-band configuration management -- no need for YARN to do this setup. 1B. A user interactively logging onto a machine would be placed into a default cgroup/network namespace so they are strictly limited. They would only be permitted to talk to the local: YARN RM, HDFS namenodes, datanodes and Oozie for job submission. (This would prevent outbound scp and allow them to only submit a job or view logs); this would be configured via our out-of-band configuration management as well. 2. Then, when a user submit's a job, YARN would setup the OS control (cgroup, network namespace or the bridge interface) for those processes to match the user's name, a queue or some other deterministic handle. (We would use that handle for our configuration-managed matching iptables rules which would be pre-configured.) 2A. An ingest user for a particular database would be permissioned to reach out to a remote database to do ingest to the local HDFS to write data and to the necessary YARN ports. (All external YARN jobs should have strict review but even if we did not strictly review, connections could only flow to this one remote location -- that one database and what that one role account could read -- likely data only from one database.) 2B. A role account or human account for running ETL and adhoc intra-cluster jobs would not be allowed to talk off the cluster. (Jobs could be arbitrary and unreviewed -- but host-based network control - software firewall - would limit that one user; yea!) 2C. An egress user responsible for writing scrubbed data back out (e.g. reports) could reach out to a specific remote service endpoint to publish data, the local HDFS and YARN. (All jobs should again get strict review but the network controls would ensure data leakage from this account was limited to that one service and what that one role account could read on HDFS.) 3. Other uses could also use this technique: 3A. YARN already uses cgroups for traffic shaping using {{tc}} to shape a container's traffic; see JIRAs around YARN-2140. 3B. In general, we could audit what traffic comes from which users and affect only bad flows or bill back for network usage. Today, I worry if we have a pathological application reach out to a service and knock it down, I only know the machines and have to correlate {{netstat}} to see what user that is (or hope I have a strong correlation)[2]. If I have OS network control, I can ask the host-based firewall to log which users/devices (namespace bridges, etc.) are talking to that service's IP to know who's running the pathological job and throttle it opposed to kill it. This is not a request for full-scale software-defined-networking integration into YARN. For example, I suspect many YARN operators would not have the organizational support or man-power to integrate something like the [Cloud Native Computing Foundation's Container Network Interface|https://github.com/containernetworking/cni/blob/master/SPEC.md] via [Project Calico|https://www.projectcalico.org/]. The hope is this does bring the "policy-driven network security" aspect of these Projects in reach of those who operate their YARN clusters and the underlying OS. [1]: http://blog.oddbit.com/2014/08/11/four-ways-to-connect-a-docker/ [2]: In all fairness, I could use [{{tcpspy}}|https://directory.fsf.org/wiki/Tcpspy] and have it record the PID of processes today too > Provide means for container network policy control > -------------------------------------------------- > > Key: YARN-7468 > URL: https://issues.apache.org/jira/browse/YARN-7468 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Reporter: Clay B. > Priority: Minor > > To prevent data exfiltration from a YARN cluster, it would be very helpful to > have "firewall" rules able to map to a user/queue's containers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org