[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955528#comment-15955528 ]
Ian Hummel edited comment on SPARK-5158 at 4/4/17 6:07 PM: ----------------------------------------------------------- At Bloomberg we've been working on a solution to this issue so we can access kerberized HDFS clusters from standalone Spark installations we run on our internal cloud infrastructure. Folks who are interested can try out a patch at https://github.com/themodernlife/spark/tree/spark-5158. It extends standalone mode to support configuration related to {{\-\-principal}} and {{\-\-keytab}}. The main changes are - Refactor {{ConfigurableCredentialManager}} and related {{CredentialProviders}} so that they are no longer tied to YARN - Setup credential renewal/updating from within the {{StandaloneSchedulerBackend}} - Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the {{ApplicationDescription}}. I might be wrong, but I'm assuming proper {{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can anyone confirm?). Credentials on the executors and the driver (cluster mode) are written to disk as whatever user the Spark daemon runs as. Open to suggestions on whether it's worth tightening that up. Would appreciate any feedback from the community. was (Author: themodernlife): At Bloomberg we've been working on a solution to this issue so we can access kerberized HDFS clusters from standalone Spark installations we run on our internal cloud infrastructure. Folks who are interested can try out a patch at https://github.com/themodernlife/spark/tree/spark-5158. It extends standalone mode to support configuration related to {{--principal}} and {{--keytab}}. The main changes are - Refactor {{ConfigurableCredentialManager}} and related {{CredentialProviders}} so that they are no longer tied to YARN - Setup credential renewal/updating from within the {{StandaloneSchedulerBackend}} - Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the {{ApplicationDescription}}. I might be wrong, but I'm assuming proper {{spark.authenticate}} setup would ensure it's encrypted over-the-wire (can anyone confirm?). Credentials on the executors and the driver (cluster mode) are written to disk as whatever user the Spark daemon runs as. Open to suggestions on whether it's worth tightening that up. Would appreciate any feedback from the community. > Allow for keytab-based HDFS security in Standalone mode > ------------------------------------------------------- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core > Reporter: Patrick Wendell > Assignee: Matthew Cheah > Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org