[ 
https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955528#comment-15955528
 ] 

Ian Hummel edited comment on SPARK-5158 at 4/4/17 6:07 PM:
-----------------------------------------------------------

At Bloomberg we've been working on a solution to this issue so we can access 
kerberized HDFS clusters from standalone Spark installations we run on our 
internal cloud infrastructure.

Folks who are interested can try out a patch at 
https://github.com/themodernlife/spark/tree/spark-5158.  It extends standalone 
mode to support configuration related to {{\-\-principal}} and {{\-\-keytab}}.

The main changes are
- Refactor {{ConfigurableCredentialManager}} and related 
{{CredentialProviders}} so that they are no longer tied to YARN
- Setup credential renewal/updating from within the 
{{StandaloneSchedulerBackend}}
- Ensure executors/drivers are able to find initial tokens for contacting HDFS 
and renew them at regular intervals

The implementation does basically the same thing as the YARN backend.  The 
keytab is copied to driver/executors through an environment variable in the 
{{ApplicationDescription}}.  I might be wrong, but I'm assuming proper 
{{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can 
anyone confirm?).  Credentials on the executors and the driver (cluster mode) 
are written to disk as whatever user the Spark daemon runs as.  Open to 
suggestions on whether it's worth tightening that up.

Would appreciate any feedback from the community.


was (Author: themodernlife):
At Bloomberg we've been working on a solution to this issue so we can access 
kerberized HDFS clusters from standalone Spark installations we run on our 
internal cloud infrastructure.

Folks who are interested can try out a patch at 
https://github.com/themodernlife/spark/tree/spark-5158.  It extends standalone 
mode to support configuration related to {{--principal}} and {{--keytab}}.

The main changes are
- Refactor {{ConfigurableCredentialManager}} and related 
{{CredentialProviders}} so that they are no longer tied to YARN
- Setup credential renewal/updating from within the 
{{StandaloneSchedulerBackend}}
- Ensure executors/drivers are able to find initial tokens for contacting HDFS 
and renew them at regular intervals

The implementation does basically the same thing as the YARN backend.  The 
keytab is copied to driver/executors through an environment variable in the 
{{ApplicationDescription}}.  I might be wrong, but I'm assuming proper 
{{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can 
anyone confirm?).  Credentials on the executors and the driver (cluster mode) 
are written to disk as whatever user the Spark daemon runs as.  Open to 
suggestions on whether it's worth tightening that up.

Would appreciate any feedback from the community.

> Allow for keytab-based HDFS security in Standalone mode
> -------------------------------------------------------
>
>                 Key: SPARK-5158
>                 URL: https://issues.apache.org/jira/browse/SPARK-5158
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Matthew Cheah
>            Priority: Critical
>
> There have been a handful of patches for allowing access to Kerberized HDFS 
> clusters in standalone mode. The main reason we haven't accepted these 
> patches have been that they rely on insecure distribution of token files from 
> the driver to the other components.
> As a simpler solution, I wonder if we should just provide a way to have the 
> Spark driver and executors independently log in and acquire credentials using 
> a keytab. This would work for users who have a dedicated, single-tenant, 
> Spark clusters (i.e. they are willing to have a keytab on every machine 
> running Spark for their application). It wouldn't address all possible 
> deployment scenarios, but if it's simple I think it's worth considering.
> This would also work for Spark streaming jobs, which often run on dedicated 
> hardware since they are long-running services.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to