[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-14 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
There is a spot in HadoopFSCredentialProvider where it looks for a Hadoop 
config key related to yarn to set the token renewer. 

In getTokenRenewer it calls Master.getMasterPrincipal(conf) which will need 
some yarn configuration set for things to succeed. 

Right now the PR doesn't set that, so it needs to be set under the user's 
HADOOP_CONF even though it had no real effect. That probably should be changed. 

Didn't have a chance to dig into the ticket you linked to but will try to 
have a look and compare notes. If anything comes to mind will comment there. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
BTW not trying to give you the hard sell and appreciate the help rounding 
out the requirements from the core committers' POV.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
That would work for cluster mode but in client mode the driver on the 
submitting nodes still needs the keytab unfortunately.  

Standalone clusters are best viewed as distributed single-user programs, so 
I think the real mistake is not bringing them into a secure environment, but 
bringing them into a secure environment and trying to use them in a 
multi-tenant/multi-user fashion.

I can see the concern that this feature might give someone who brings 
standalone clusters into a kerberized environment a false sense of security.  
What about disabling unless something like `spark.standalone.single-user` is 
set to true?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
Said another way people need another layer to use spark standalone in 
secured environments anyway. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
To me it's basically the same as users including S3 credentials when 
submitting to spark standalone. Kerberos just requires more machinery. It might 
be a little harder to get at the spark conf entries of another user's job, but 
still possible since everything runs as the same unix user and shares the 
cluster secret.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
That's right, but you still need a separate out of band process refreshing 
with the KDC. My thinking is why not have spark do that on your behalf?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
In our setup each user gets their own standalone cluster. Users cannot 
submit jobs to each other's clusters. By providing a keytab on cluster creation 
and having Spark manage renewal on behalf of the user, we can support long 
running jobs with less headache.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17530: [SPARK-5158] Access kerberized HDFS from Spark standalon...

2017-04-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/17530
  
Hi @vanzin, spark standalone isn't really multi user in any sense since the 
executors for all jobs run as whatever user the worker daemon was started as. 
That shouldn't preclude standalone clusters from communicating with secured 
resources. 

Happy to add some additional documentation on this very point to the PR. 

Any other other thoughts?

Thanks,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17530: [SPARK-5158] Access kerberized HDFS from Spark st...

2017-04-04 Thread themodernlife
GitHub user themodernlife opened a pull request:

https://github.com/apache/spark/pull/17530

[SPARK-5158] Access kerberized HDFS from Spark standalone

## What changes were proposed in this pull request?

- Refactor `ConfigurableCredentialManager` and related 
`CredentialProviders` so that they are no longer tied to YARN
- Setup credential renewal/updating from within the 
`StandaloneSchedulerBackend`
- Ensure executors/drivers are able to find initial tokens for contacting 
HDFS and renew them at regular intervals

The implementation does basically the same thing as the YARN backend. The 
keytab is copied to driver/executors through an environment variable in the 
`ApplicationDescription`.

## How was this patch tested?

https://github.com/themodernlife/spark-standalone-kerberos contains a 
docker-compose environment with a KDC and Kerberized HDFS mini-cluster.  The 
README contains instructions for running the integration test script to see 
credential refresh/updating occur.  Credentials are set to update very 2 
minutes or so.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/themodernlife/spark spark-5158

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17530.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17530


commit 62a6e20179dd63703d18de9784c8b3770077e968
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-24T21:29:43Z

WIP

commit accfe0cebc645ed2b99aaded7629b93b56fcb7ea
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-24T21:35:24Z

Add license header that somehow got removed

commit b8559b5895c81c871b1db00b75f038082b2dd4fb
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-24T21:46:18Z

Fixup tests

commit 539cc6cf630e9429e7131e755d8e9fa12479cd0c
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-26T01:01:12Z

WIP

commit 3f76281094493d63b6364fe38612e56f437c6a7c
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-27T21:26:48Z

Push delegation token out to ExecutorRunner

commit 25e7639af248bba4f648d13f5dc76a4fe8bfca34
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-02-28T21:21:10Z

More wip... probably borked

commit 847f6044d2fd0bf1af52d3d7c5d618c8e537e916
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-02T16:48:45Z

Untested... make cluster mode work with standalone

commit 4689a55402f193199faf2dc2e2c6c4c904e34bf0
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-07T16:35:51Z

Hadoop FileInputFormat is hardcoded to request delegation tokens with 
renewer = yarn.resourcemanager.principal

commit 3e85aa5bfbaee2760d9eb3559d23546508b463d9
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-07T20:59:21Z

Still need to sort out a few things, but overall much smaller patch-set

commit f743e6b207b7f71034fe617a402f54e0121b13a2
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-08T17:06:14Z

WIP

commit 31c91dcec25718052ae5c775bfe1b41359e8840f
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-08T18:14:48Z

WIP

commit 19644195af14c9b8a451609157b9d47f7251ced4
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-08T22:15:41Z

Still something isn't working

commit b5bacf31e00243073e7311b768a13aec51c6b9db
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-15T15:56:41Z

Merge master

commit 83f05014659e08a4cd8c9703941c98aaaba9eb31
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-03-15T20:28:56Z

Actually use credential updater

commit 917b077ca1e05a9bb44bcb91c33ed64a1d1c364c
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-04-04T14:38:18Z

Change order of configuration setting so that everything works

commit a4c22a92496271935a769313b09da1b8ae88107a
Author: Ian Hummel <ihum...@bloomberg.net>
Date:   2017-04-04T14:39:44Z

Merge branch 'master' into spark-5158

* master: (164 commits)
  [SPARK-20198][SQL] Remove the inconsistency in table/function name 
conventions in SparkSession.Catalog APIs
  [SPARK-20190][APP-ID] applications//jobs' in rest api,status should be 
[running|s…
  [SPARK-19825][R][ML] spark.ml R API for FPGrowth
  [SPARK-20067][SQL] Unify and Clean Up Desc Commands Using Catalog 
Interface
  [SPARK-10364][SQL] Support Parquet logical type TIMESTAMP_MILLIS
  [SPARK-19408][SQL] filter estimation on two columns of same table
  [SPARK-20145] Fix range case insensitive bug in SQL
  [SPARK-20194] Add support for partition pruning to in-memory catalog
  [SPARK-19641][SQL] JSON schema inference in DROPMALFORMED mode produces 
incorrect schema for non-array/object JSONs
  [SPARK-19969][ML] Imputer doc

[GitHub] spark issue #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...

2017-01-13 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/16563
  
Ok, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...

2017-01-13 Thread themodernlife
Github user themodernlife closed the pull request at:

https://github.com/apache/spark/pull/16563


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16563: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...

2017-01-12 Thread themodernlife
GitHub user themodernlife opened a pull request:

https://github.com/apache/spark/pull/16563

[SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy 
settings used to resolve packages/artifacts

Backports #15119 to the 2.1 branch.   Is it possible to include this in 
Spark 2.1.1?

@BryanCutler @vanzin FYI

I'm currently doing some more testing on this, but wanted to get a PR made 
in any case.  Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/themodernlife/spark backport-spark-17568

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16563






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...

2017-01-05 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/15119
  
@BryanCutler 

- #1 I think you're right... the naming of that key is unfortuante, 
`spark.jars.ivyUserDir` or something would have been better... it affects 
`defaultIvyUserDir` property of the Ivy settings
- #2 sounds right to me




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit opti...

2016-12-16 Thread themodernlife
Github user themodernlife commented on a diff in the pull request:

https://github.com/apache/spark/pull/15119#discussion_r92817182
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -291,8 +292,12 @@ object SparkSubmit {
   } else {
 Nil
   }
+
+val ivySettings = 
Option(args.ivySettingsFile).map(SparkSubmitUtils.loadIvySettings).getOrElse(
--- End diff --

@vanzin I thought this would be useful too, but it turns out I haven't 
missed it at all using this patch for the last few weeks.  It's no big deal to 
add any "one off" repositories to an xml file if you need that kind of 
customization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15119: [SPARK-17568][CORE][DEPLOY] Add spark-submit option to o...

2016-11-08 Thread themodernlife
Github user themodernlife commented on the issue:

https://github.com/apache/spark/pull/15119
  
FYI I tried this out in our environment
- Firewall (no access to maven central)
- Custom ivysettings.xml to point to our internal Artifactory

Everything worked just as I'd expected.  Hope to see this merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-19 Thread themodernlife
Github user themodernlife commented on a diff in the pull request:

https://github.com/apache/spark/pull/2450#discussion_r17808274
  
--- Diff: 
core/src/test/scala/org/apache/spark/rdd/PairRDDFunctionsSuite.scala ---
@@ -478,6 +482,15 @@ class PairRDDFunctionsSuite extends FunSuite with 
SharedSparkContext {
 pairs.saveAsNewAPIHadoopFile[ConfigTestFormat](ignored)
   }
 
+  test(saveAsHadoopFile should respect configured output committers) {
+val pairs = sc.parallelize(Array((new Integer(1), new Integer(1
+val conf = new JobConf(sc.hadoopConfiguration)
+conf.setOutputCommitter(classOf[FakeOutputCommitter])
+pairs.saveAsHadoopFile(ignored, pairs.keyClass, pairs.valueClass, 
classOf[FakeOutputFormat], conf)
+val ran = sys.props.remove(mapred.committer.ran)
--- End diff --

Agreed, this part's ugly but it seemed like the least invasive way.  I also 
thought about maybe using a ThreadLocal but didn't get too far.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3595] Respect configured OutputCommitte...

2014-09-18 Thread themodernlife
GitHub user themodernlife opened a pull request:

https://github.com/apache/spark/pull/2450

[SPARK-3595] Respect configured OutputCommitters when calling 
saveAsHadoopFile

Addresses the issue in https://issues.apache.org/jira/browse/SPARK-3595, 
namely saveAsHadoopFile hardcoding the OutputCommitter.  This is not ideal when 
running Spark jobs that write to S3, especially when running them from an EMR 
cluster where the default OutputCommitter is a DirectOutputCommitter.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/themodernlife/spark spark-3595

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2450.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2450


commit 8b6be94801ca33bca32aa574b1a8f6a76760869d
Author: Ian Hummel i...@themodernlife.net
Date:   2014-09-15T15:38:59Z

Add ability to specify OutputCommitter, espcially useful when writing to an 
S3 bucket from an EMR cluster

commit 4359664b1d557d55b0579023df809542386d5b8c
Author: Ian Hummel i...@themodernlife.net
Date:   2014-09-18T20:18:57Z

Add an example showing usage

commit a11d9f3806e6a8d06d13417af9f27bfd3795334b
Author: Ian Hummel i...@themodernlife.net
Date:   2014-09-18T20:52:17Z

Fix formatting




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org