[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop config key related to yarn to set the token renewer. In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need some yarn configuration set for things to succeed. Right now the PR doesn't set that, so it needs to be set under the user's HADOOP_CONF even though it had no real effect. That probably should be changed. Didn't have a chance to dig into the ticket you linked to but will try to have a look and compare notes. If anything comes to mind will comment there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 BTW not trying to give you the hard sell and appreciate the help rounding out the requirements from the core committers' POV. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 That would work for cluster mode but in client mode the driver on the submitting nodes still needs the keytab unfortunately. Standalone clusters are best viewed as distributed single-user programs, so I think the real mistake is not bringing them into a secure environment, but bringing them into a secure environment and trying to use them in a multi-tenant/multi-user fashion. I can see the concern that this feature might give someone who brings standalone clusters into a kerberized environment a false sense of security. What about disabling unless something like `spark.standalone.single-user` is set to true? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 Said another way people need another layer to use spark standalone in secured environments anyway. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 To me it's basically the same as users including S3 credentials when submitting to spark standalone. Kerberos just requires more machinery. It might be a little harder to get at the spark conf entries of another user's job, but still possible since everything runs as the same unix user and shares the cluster secret. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 That's right, but you still need a separate out of band process refreshing with the KDC. My thinking is why not have spark do that on your behalf? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 In our setup each user gets their own standalone cluster. Users cannot submit jobs to each other's clusters. By providing a keytab on cluster creation and having Spark manage renewal on behalf of the user, we can support long running jobs with less headache. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/17530 Hi @vanzin, spark standalone isn't really multi user in any sense since the executors for all jobs run as whatever user the worker daemon was started as. That shouldn't preclude standalone clusters from communicating with secured resources. Happy to add some additional documentation on this very point to the PR. Any other other thoughts? Thanks, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17530: [SPARK-5158] Access kerberized HDFS from Spark st...
GitHub user themodernlife opened a pull request: https://github.com/apache/spark/pull/17530 [SPARK-5158] Access kerberized HDFS from Spark standalone ## What changes were proposed in this pull request? - Refactor `ConfigurableCredentialManager` and related `CredentialProviders` so that they are no longer tied to YARN - Setup credential renewal/updating from within the `StandaloneSchedulerBackend` - Ensure executors/drivers are able to find initial tokens for contacting HDFS and renew them at regular intervals The implementation does basically the same thing as the YARN backend. The keytab is copied to driver/executors through an environment variable in the `ApplicationDescription`. ## How was this patch tested? https://github.com/themodernlife/spark-standalone-kerberos contains a docker-compose environment with a KDC and Kerberized HDFS mini-cluster. The README contains instructions for running the integration test script to see credential refresh/updating occur. Credentials are set to update very 2 minutes or so. You can merge this pull request into a Git repository by running: $ git pull https://github.com/themodernlife/spark spark-5158 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17530.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17530 commit 62a6e20179dd63703d18de9784c8b3770077e968 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-24T21:29:43Z WIP commit accfe0cebc645ed2b99aaded7629b93b56fcb7ea Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-24T21:35:24Z Add license header that somehow got removed commit b8559b5895c81c871b1db00b75f038082b2dd4fb Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-24T21:46:18Z Fixup tests commit 539cc6cf630e9429e7131e755d8e9fa12479cd0c Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-26T01:01:12Z WIP commit 3f76281094493d63b6364fe38612e56f437c6a7c Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-27T21:26:48Z Push delegation token out to ExecutorRunner commit 25e7639af248bba4f648d13f5dc76a4fe8bfca34 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-02-28T21:21:10Z More wip... probably borked commit 847f6044d2fd0bf1af52d3d7c5d618c8e537e916 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-02T16:48:45Z Untested... make cluster mode work with standalone commit 4689a55402f193199faf2dc2e2c6c4c904e34bf0 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-07T16:35:51Z Hadoop FileInputFormat is hardcoded to request delegation tokens with renewer = yarn.resourcemanager.principal commit 3e85aa5bfbaee2760d9eb3559d23546508b463d9 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-07T20:59:21Z Still need to sort out a few things, but overall much smaller patch-set commit f743e6b207b7f71034fe617a402f54e0121b13a2 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-08T17:06:14Z WIP commit 31c91dcec25718052ae5c775bfe1b41359e8840f Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-08T18:14:48Z WIP commit 19644195af14c9b8a451609157b9d47f7251ced4 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-08T22:15:41Z Still something isn't working commit b5bacf31e00243073e7311b768a13aec51c6b9db Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-15T15:56:41Z Merge master commit 83f05014659e08a4cd8c9703941c98aaaba9eb31 Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-03-15T20:28:56Z Actually use credential updater commit 917b077ca1e05a9bb44bcb91c33ed64a1d1c364c Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-04-04T14:38:18Z Change order of configuration setting so that everything works commit a4c22a92496271935a769313b09da1b8ae88107a Author: Ian Hummel <ihum...@bloomberg.net> Date: 2017-04-04T14:39:44Z Merge branch 'master' into spark-5158 * master: (164 commits) [SPARK-20198][SQL] Remove the inconsistency in table/function name conventions in SparkSession.Catalog APIs [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be [running|s⦠[SPARK-19825][R][ML] spark.ml R API for FPGrowth [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog Interface [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS [SPARK-19408][SQL] filter estimation on two columns of same table [SPARK-20145] Fix range case insensitive bug in SQL [SPARK-20194] Add support for partition pruning to in-memory catalog [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces incorrect schema for non-array/object JSONs [SPARK-19969][ML] Imputer doc
[GitHub] spark issue #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/16563 Ok, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...
Github user themodernlife closed the pull request at: https://github.com/apache/spark/pull/16563 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...
GitHub user themodernlife opened a pull request: https://github.com/apache/spark/pull/16563 [SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts Backports #15119 to the 2.1 branch. Is it possible to include this in Spark 2.1.1? @BryanCutler @vanzin FYI I'm currently doing some more testing on this, but wanted to get a PR made in any case. Thanks! You can merge this pull request into a Git repository by running: $ git pull https://github.com/themodernlife/spark backport-spark-17568 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16563.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16563 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/15119 @BryanCutler - #1 I think you're right... the naming of that key is unfortuante, `spark.jars.ivyUserDir` or something would have been better... it affects `defaultIvyUserDir` property of the Ivy settings - #2 sounds right to me --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...
Github user themodernlife commented on a diff in the pull request: https://github.com/apache/spark/pull/15119#discussion_r92817182 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -291,8 +292,12 @@ object SparkSubmit { } else { Nil } + +val ivySettings = Option(args.ivySettingsFile).map(SparkSubmitUtils.loadIvySettings).getOrElse( --- End diff -- @vanzin I thought this would be useful too, but it turns out I haven't missed it at all using this patch for the last few weeks. It's no big deal to add any "one off" repositories to an xml file if you need that kind of customization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...
Github user themodernlife commented on the issue: https://github.com/apache/spark/pull/15119 FYI I tried this out in our environment - Firewall (no access to maven central) - Custom ivysettings.xml to point to our internal Artifactory Everything worked just as I'd expected. Hope to see this merged! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
Github user themodernlife commented on a diff in the pull request: https://github.com/apache/spark/pull/2450#discussion_r17808274 --- Diff: core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala --- @@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with SharedSparkContext { pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored) } + test(saveAsHadoopFile should respect configured output committers) { +val pairs = sc.parallelize(Array((new Integer(1), new Integer(1 +val conf = new JobConf(sc.hadoopConfiguration) +conf.setOutputCommitter(classOf[FakeOutputCommitter]) +pairs.saveAsHadoopFile(ignored, pairs.keyClass, pairs.valueClass, classOf[FakeOutputFormat], conf) +val ran = sys.props.remove(mapred.committer.ran) --- End diff -- Agreed, this part's ugly but it seemed like the least invasive way. I also thought about maybe using a ThreadLocal but didn't get too far. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...
GitHub user themodernlife opened a pull request: https://github.com/apache/spark/pull/2450 [SPARK-3595] Respect configured OutputCommitters when calling saveAsHadoopFile Addresses the issue in https://issues.apache.org/jira/browse/SPARK-3595, namely saveAsHadoopFile hardcoding the OutputCommitter. This is not ideal when running Spark jobs that write to S3, especially when running them from an EMR cluster where the default OutputCommitter is a DirectOutputCommitter. You can merge this pull request into a Git repository by running: $ git pull https://github.com/themodernlife/spark spark-3595 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2450.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2450 commit 8b6be94801ca33bca32aa574b1a8f6a76760869d Author: Ian Hummel i...@themodernlife.net Date: 2014-09-15T15:38:59Z Add ability to specify OutputCommitter, espcially useful when writing to an S3 bucket from an EMR cluster commit 4359664b1d557d55b0579023df809542386d5b8c Author: Ian Hummel i...@themodernlife.net Date: 2014-09-18T20:18:57Z Add an example showing usage commit a11d9f3806e6a8d06d13417af9f27bfd3795334b Author: Ian Hummel i...@themodernlife.net Date: 2014-09-18T20:52:17Z Fix formatting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org