[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856 ] Jelmer Kuperus commented on SPARK-5158: --- I ended up with the following workaround which at first glance seems to work 1. create a `.java.login.config` file in the home directory of the spark with the following contents {noformat} com.sun.security.jgss.krb5.initiate { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true useTicketCache="true" ticketCache="/tmp/krb5cc_0" keyTab="/path/to/my.keytab" principal="u...@foo.com"; };{noformat} 2. put a krb5.conf file in /etc/krb5.conf 3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` set : * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1 * hadoop.security.authentication to kerberos * hadoop.security.authorization to true 4. make sure the hadoop config gets is on the classpath of spark. Eg the process should have something like this in it {noformat} -cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat} > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955564#comment-15955564 ] Apache Spark commented on SPARK-5158: - User 'themodernlife' has created a pull request for this issue: https://github.com/apache/spark/pull/17530 > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955528#comment-15955528 ] Ian Hummel commented on SPARK-5158: --- At Bloomberg we've been working on a solution to this issue so we can access kerberized HDFS clusters from standalone Spark installations we run on our internal cloud infrastructure. Folks who are interested can try out a patch at https://github.com/themodernlife/spark/tree/spark-5158. It extends standalone mode to support configuration related to {{--principal}} and {{--keytab}}. The main changes are - Refactor {{ConfigurableCredentialManager}} and related {{CredentialProviders}} so that they are no longer tied to YARN - Setup credential renewal/updating from within the {{StandaloneSchedulerBackend}} - Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the {{ApplicationDescription}}. I might be wrong, but I'm assuming proper {{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can anyone confirm?). Credentials on the executors and the driver (cluster mode) are written to disk as whatever user the Spark daemon runs as. Open to suggestions on whether it's worth tightening that up. Would appreciate any feedback from the community. > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15800298#comment-15800298 ] Ruslan Dautkhanov commented on SPARK-5158: -- I think one reason for that could be that one user can submit multiple Spark jobs with different --principal and --keytab parameters and they would run under those cridentials, at the same time. > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595524#comment-15595524 ] Ian Hummel commented on SPARK-5158: --- I'm running into this now and have done some digging. My setup is - Small, dedicated standalone spark cluster -- using spark.authenticate.secret -- spark-env.sh sets HADOOP_CONF_DIR correctly on each node -- core-site.xml has --- hadoop.security.authentication = kerberos --- hadoop.security.authorization = true - Kerberized HDFS cluster Reading and writing to HDFS in local mode works fine, provided I have run {{kinit}} beforehand. Running a distributed job via the standalone cluster does not, seemingly because clients connecting to standalone clusters don't attempt to fetch/forward HDFS delegation tokens. What I had hoped would work is ssh'ing onto each standalone worker node individually and running kinit out-of-process before submitting my job. I figured that since the executors are launched as my unix user that they would inherit my kerberos context and be able to talk to HDFS, just as they can in local mode. I verified with a debugger that the {{UserGroupInformation}} in the worker JVMs correctly picks up the fact that the user the process is running as can access the kerberos ticket cache. But it still doesn't work. The reason is that the executor process ({{CoarseGrainedExecutorBackend}}) does something like this: {code} SparkHadoopUtil.get.runAsSparkUser { () => ... env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(env.rpcEnv, driverUrl, executorId, hostname, cores, userClassPath, env)) ... } {code} {{runAsSparkUser}} does this: {code} def runAsSparkUser(func: () => Unit) { val user = Utils.getCurrentUserName() logDebug("running as user: " + user) val ugi = UserGroupInformation.createRemoteUser(user) transferCredentials(UserGroupInformation.getCurrentUser(), ugi) ugi.doAs(new PrivilegedExceptionAction[Unit] { def run: Unit = func() }) } {code} {{createRemoteUser}} does this: {code} public static UserGroupInformation createRemoteUser(String user) { if (user == null || user.isEmpty()) { throw new IllegalArgumentException("Null user"); } Subject subject = new Subject(); subject.getPrincipals().add(new User(user)); UserGroupInformation result = new UserGroupInformation(subject); result.setAuthenticationMethod(AuthenticationMethod.SIMPLE); return result; } {code} So effectively, if we had an HDFS delegation token, we would have copied it over in {{transferCredentials}}, but since there is no way for the client to include them when the task is submitted over-the wire, we are creating a _blank_ UGI from scratch and losing the Kerberos context. Subsequent calls to HDFS are attempting with "simple" authentication and everything fails. One workaround is that you can obtain an HDFS delegation token out of band, store it in a file, make it available on all worker nodes and then ensure executors are launched with {{HADOOP_TOKEN_FILE_LOCATION}} set. To be more specific: On client machine: - ensure {{core-site.xml}}, {{hdfs-site.xml}} and {{yarn-site.xml}} are configured properly - ensure {{HDFS_CONF_DIR}} is set - Run {{spark-submit --class org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null /nfs/path/to/TOKEN}} On worker machines: - ensure {{/nfs/path/to/TOKEN}} is readable On client machine: - submit job adding {{--conf "spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/nfs/path/to/TOKEN"}} There are obviously issues with this in terms of expiration, renewal, etc... just wanted to mention it for the record. Another workaround is a build of Spark which simply comments out the {{runAsSparkUser}} call. In this case users can simply have a cron job running kinit in the background (using a keytab) and the spawned executor will use the inherited kerberos context to talk to HDFS. It seems like {{CoarseGrainedExecutorBackend}} is also used by Mesos, and I noticed SPARK-12909. If security doesn't even work for Mesos or Standalone why do even try the {{runAsSparkUser}} call? It honestly seems like there is no reason for that... proxy users are not useful outside of a kerberized context (right?). There is no real secured user identity when running as a standalone cluster (or from what I can tell when running under Mesos), only that which comes from whatever unix user the workers are running as. As it stands, we actually _deescalate_ that user's privileges (by wiping the kerberos context). Shouldn't we just keep them as they are? This makes it a lot easier for standalone clusters to interact with a kerberized HDFS. I know this ticket is more about forwarding keytabs to the executors, but the scenario outlined above also gets to that use case. Thoughts? > Allow for keytab-based HDFS security in Standalone mode >
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149309#comment-15149309 ] Henry Saputra commented on SPARK-5158: -- All, the PR for this issues are closed. This PR: https://github.com/apache/spark/pull/265 is closed claiming there is a more recent PR is being work on, which I assume is this one: https://github.com/apache/spark/pull/4106 but this one is also closed due to inactivity. Looking at the issues filed that are closed as duplicate for this one, there is a need and interest to get standalone mode to access secured HDFS given the active users keytab already available to the machines that run Spark. > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282951#comment-14282951 ] Apache Spark commented on SPARK-5158: - User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/4106 Allow for keytab-based HDFS security in Standalone mode --- Key: SPARK-5158 URL: https://issues.apache.org/jira/browse/SPARK-5158 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Matthew Cheah Priority: Critical There have been a handful of patches for allowing access to Kerberized HDFS clusters in standalone mode. The main reason we haven't accepted these patches have been that they rely on insecure distribution of token files from the driver to the other components. As a simpler solution, I wonder if we should just provide a way to have the Spark driver and executors independently log in and acquire credentials using a keytab. This would work for users who have a dedicated, single-tenant, Spark clusters (i.e. they are willing to have a keytab on every machine running Spark for their application). It wouldn't address all possible deployment scenarios, but if it's simple I think it's worth considering. This would also work for Spark streaming jobs, which often run on dedicated hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org