[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2018-06-19 Thread Jelmer Kuperus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16516856#comment-16516856
 ] 

Jelmer Kuperus commented on SPARK-5158:
---

I ended up with the following workaround which at first glance seems to work

1. create a `.java.login.config` file in the home directory of the spark with 
the following contents


{noformat}
com.sun.security.jgss.krb5.initiate {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  useTicketCache="true"
  ticketCache="/tmp/krb5cc_0"
  keyTab="/path/to/my.keytab"
  principal="u...@foo.com";
};{noformat}

2. put a krb5.conf file in /etc/krb5.conf

3. place your hadoop configuration in /etc/hadoop/conf and in `core-site.xml` 
set : 
 * fs.defaultFS to webhdfs://your_hostname:14000/webhdfs/v1
 * hadoop.security.authentication to kerberos
 * hadoop.security.authorization to true

4. make sure the hadoop config gets is on the classpath of spark. Eg the 
process should have something like this in it
{noformat}
-cp /etc/spark/:/usr/share/spark/jars/*:/etc/hadoop/conf/{noformat}
 

 

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2017-04-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955564#comment-15955564
 ] 

Apache Spark commented on SPARK-5158:
-

User 'themodernlife' has created a pull request for this issue:
https://github.com/apache/spark/pull/17530

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2017-04-04 Thread Ian Hummel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15955528#comment-15955528
 ] 

Ian Hummel commented on SPARK-5158:
---

At Bloomberg we've been working on a solution to this issue so we can access 
kerberized HDFS clusters from standalone Spark installations we run on our 
internal cloud infrastructure.

Folks who are interested can try out a patch at 
https://github.com/themodernlife/spark/tree/spark-5158.  It extends standalone 
mode to support configuration related to {{--principal}} and {{--keytab}}.

The main changes are
- Refactor {{ConfigurableCredentialManager}} and related 
{{CredentialProviders}} so that they are no longer tied to YARN
- Setup credential renewal/updating from within the 
{{StandaloneSchedulerBackend}}
- Ensure executors/drivers are able to find initial tokens for contacting HDFS 
and renew them at regular intervals

The implementation does basically the same thing as the YARN backend.  The 
keytab is copied to driver/executors through an environment variable in the 
{{ApplicationDescription}}.  I might be wrong, but I'm assuming proper 
{{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can 
anyone confirm?).  Credentials on the executors and the driver (cluster mode) 
are written to disk as whatever user the Spark daemon runs as.  Open to 
suggestions on whether it's worth tightening that up.

Would appreciate any feedback from the community.

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2017-01-04 Thread Ruslan Dautkhanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15800298#comment-15800298
 ] 

Ruslan Dautkhanov commented on SPARK-5158:
--

I think one reason for that could be that one user can submit multiple Spark 
jobs with different --principal and --keytab parameters and they would run 
under those cridentials, at the same time. 

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2016-10-21 Thread Ian Hummel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595524#comment-15595524
 ] 

Ian Hummel commented on SPARK-5158:
---

I'm running into this now and have done some digging.  My setup is

- Small, dedicated standalone spark cluster
-- using spark.authenticate.secret
-- spark-env.sh sets HADOOP_CONF_DIR correctly on each node
-- core-site.xml has
--- hadoop.security.authentication = kerberos
--- hadoop.security.authorization = true
- Kerberized HDFS cluster

Reading and writing to HDFS in local mode works fine, provided I have run 
{{kinit}} beforehand.  Running a distributed job via the standalone cluster 
does not, seemingly because clients connecting to standalone clusters don't 
attempt to fetch/forward HDFS delegation tokens.

What I had hoped would work is ssh'ing onto each standalone worker node 
individually and running kinit out-of-process before submitting my job.  I 
figured that since the executors are launched as my unix user that they would 
inherit my kerberos context and be able to talk to HDFS, just as they can in 
local mode. 

I verified with a debugger that the {{UserGroupInformation}} in the worker JVMs 
correctly picks up the fact that the user the process is running as can access 
the kerberos ticket cache.

But it still doesn't work.

The reason is that the executor process ({{CoarseGrainedExecutorBackend}}) does 
something like this:

{code}
SparkHadoopUtil.get.runAsSparkUser { () =>
...
env.rpcEnv.setupEndpoint("Executor", new 
CoarseGrainedExecutorBackend(env.rpcEnv, driverUrl, executorId, hostname, 
cores, userClassPath, env))  
...
}
{code}

{{runAsSparkUser}} does this:

{code}
  def runAsSparkUser(func: () => Unit) {
val user = Utils.getCurrentUserName()
logDebug("running as user: " + user)
val ugi = UserGroupInformation.createRemoteUser(user)
transferCredentials(UserGroupInformation.getCurrentUser(), ugi)
ugi.doAs(new PrivilegedExceptionAction[Unit] {
  def run: Unit = func()
})
  }
{code}

{{createRemoteUser}} does this:

{code}
  public static UserGroupInformation createRemoteUser(String user) {
if (user == null || user.isEmpty()) {
  throw new IllegalArgumentException("Null user");
}
Subject subject = new Subject();
subject.getPrincipals().add(new User(user));
UserGroupInformation result = new UserGroupInformation(subject);
result.setAuthenticationMethod(AuthenticationMethod.SIMPLE);
return result;
  }
{code}

So effectively, if we had an HDFS delegation token, we would have copied it 
over in {{transferCredentials}}, but since there is no way for the client to 
include them when the task is submitted over-the wire, we are creating a 
_blank_ UGI from scratch and losing the Kerberos context.  Subsequent calls to 
HDFS are attempting with "simple" authentication and everything fails.


One workaround is that you can obtain an HDFS delegation token out of band, 
store it in a file, make it available on all worker nodes and then ensure 
executors are launched with {{HADOOP_TOKEN_FILE_LOCATION}} set.  To be more 
specific:

On client machine:
- ensure {{core-site.xml}}, {{hdfs-site.xml}} and {{yarn-site.xml}} are 
configured properly
- ensure {{HDFS_CONF_DIR}} is set
- Run {{spark-submit --class 
org.apache.hadoop.hdfs.tools.DelegationTokenFetcher "" --renewer null 
/nfs/path/to/TOKEN}}

On worker machines:
- ensure {{/nfs/path/to/TOKEN}} is readable

On client machine:
- submit job adding {{--conf 
"spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION=/nfs/path/to/TOKEN"}}

There are obviously issues with this in terms of expiration, renewal, etc... 
just wanted to mention it for the record.


Another workaround is a build of Spark which simply comments out the 
{{runAsSparkUser}} call.  In this case users can simply have a cron job running 
kinit in the background (using a keytab) and the spawned executor will use the 
inherited kerberos context to talk to HDFS.

It seems like {{CoarseGrainedExecutorBackend}} is also used by Mesos, and I 
noticed SPARK-12909. If security doesn't even work for Mesos or Standalone why 
do even try the {{runAsSparkUser}} call?  It honestly seems like there is no 
reason for that... proxy users are not useful outside of a kerberized context 
(right?).  There is no real secured user identity when running as a standalone 
cluster (or from what I can tell when running under Mesos), only that which 
comes from whatever unix user the workers are running as.

As it stands, we actually _deescalate_ that user's privileges (by wiping the 
kerberos context).  Shouldn't we just keep them as they are?  This makes it a 
lot easier for standalone clusters to interact with a kerberized HDFS.

I know this ticket is more about forwarding keytabs to the executors, but the 
scenario outlined above also gets to that use case.

Thoughts?

> Allow for keytab-based HDFS security in Standalone mode
> 

[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2016-02-16 Thread Henry Saputra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149309#comment-15149309
 ] 

Henry Saputra commented on SPARK-5158:
--

All, the PR for this issues are closed.

This PR: 
https://github.com/apache/spark/pull/265 

is closed claiming there is a more recent PR is being work on, which I assume 
is this one:

https://github.com/apache/spark/pull/4106

but this one is also closed due to inactivity.

Looking at the issues filed that are closed as duplicate for this one, there is 
a need and interest to get standalone mode to access secured HDFS given the 
active users keytab already available to the machines that run Spark.

> Allow for keytab-based HDFS security in Standalone mode
> ---
>
> Key: SPARK-5158
> URL: https://issues.apache.org/jira/browse/SPARK-5158
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Matthew Cheah
>Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode

2015-01-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14282951#comment-14282951
 ] 

Apache Spark commented on SPARK-5158:
-

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/4106

 Allow for keytab-based HDFS security in Standalone mode
 ---

 Key: SPARK-5158
 URL: https://issues.apache.org/jira/browse/SPARK-5158
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Matthew Cheah
Priority: Critical

 There have been a handful of patches for allowing access to Kerberized HDFS 
 clusters in standalone mode. The main reason we haven't accepted these 
 patches have been that they rely on insecure distribution of token files from 
 the driver to the other components.
 As a simpler solution, I wonder if we should just provide a way to have the 
 Spark driver and executors independently log in and acquire credentials using 
 a keytab. This would work for users who have a dedicated, single-tenant, 
 Spark clusters (i.e. they are willing to have a keytab on every machine 
 running Spark for their application). It wouldn't address all possible 
 deployment scenarios, but if it's simple I think it's worth considering.
 This would also work for Spark streaming jobs, which often run on dedicated 
 hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org