[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-73537266 sorry for my delay, I was out last week. I would have to agree with @JoshRosen last comment. If they have already created the RDD then I wouldn't expect any changes to the hadoop configuration to apply. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72710717 I think @JoshRosen is right. I don't think we need to worry about the change in the conf after the RDD has been defined. That makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72572163 LGTM and seems very straightforward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72554774 @pwendell - This is a small enough patch - and relatively less risk. It would be great to merge this into 1.3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4292 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72602613 LGTM, too, so I'm going to merge this into `master` (1.3.0). Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/4292#discussion_r23985702 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli kClass: Class[K], vClass: Class[V]): RDD[(K, V)] = { assertNotStopped() -new NewHadoopRDD(this, fClass, kClass, vClass, conf) +// Add necessary security credentials to the JobConf. Required to access secure HDFS. +val jconf = new JobConf(conf) +SparkHadoopUtil.get.addCredentials(jconf) --- End diff -- Yep, looks like `addCredentials` is implemented as a no-op, so this should be fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72603547 For example: ```scala 15/02/02 22:49:12 INFO SparkILoop: Created spark context.. Spark context available as sc. scala import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat} import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, SequenceFileInputFormat, TextInputFormat} scala import org.apache.hadoop.conf.Configuration import org.apache.hadoop.conf.Configuration scala val conf = new Configuration() conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml scala val jobConf = new JobConf(conf) jobConf: org.apache.hadoop.mapred.JobConf = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml scala jobConf.getInt(myInt, 0) res3: Int = 0 scala conf.setInt(myInt, 1) scala jobConf.getInt(myInt, 0) res5: Int = 0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72604443 @JoshRosen that's a great point and could cause regressing behavior that would be really hard for users to diagnose. @tgravescs. What about deferring the injection of the credentials until just before the conf is broadcast? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72604619 We could also just leave it as-is and then do something like that if we find this is encountered by users. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72605117 I found some [previous discussion](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160842page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160842) of this issue. I'd say that expecting `sc.hadoopConfiguration` to be mutated by users after it's already been used to define RDDs isn't something that we can / should realistically hope to support because there's just way too many ways that it could break (e.g. defensive copying, serialization, etc) and because it runs counter to user expectations around other types of Spark configurations (e.g. user modifications to SparkConf after creating SparkContext will not take effect). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72603238 Ugh, I just realized that this might potentially regress behavior for some weird corner-cases that arise due to our shared mutable `hadoopConfiguration`. A common use-case for `sc.hadoopConfiguration` is to pass credentials for S3 filesystems. The problem that crops up is when a user has already defined a bunch of RDDs and then mutates the configuration to pass credentials: in this case, I think this patch will break those user programs because modifications to the `hadoopConfiguration` won't be reflected in the `JobConf`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user lianhuiwang commented on a diff in the pull request: https://github.com/apache/spark/pull/4292#discussion_r2348 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli kClass: Class[K], vClass: Class[V]): RDD[(K, V)] = { assertNotStopped() -new NewHadoopRDD(this, fClass, kClass, vClass, conf) +// Add necessary security credentials to the JobConf. Required to access secure HDFS. +val jconf = new JobConf(conf) +SparkHadoopUtil.get.addCredentials(jconf) --- End diff -- if mode is not yarn, SparkHadoopUtil.addCredentials didnot do anything. so here it donot resolve when non-Yarn mode. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user harishreedharan commented on a diff in the pull request: https://github.com/apache/spark/pull/4292#discussion_r23893987 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli kClass: Class[K], vClass: Class[V]): RDD[(K, V)] = { assertNotStopped() -new NewHadoopRDD(this, fClass, kClass, vClass, conf) +// Add necessary security credentials to the JobConf. Required to access secure HDFS. +val jconf = new JobConf(conf) +SparkHadoopUtil.get.addCredentials(jconf) --- End diff -- Since security is supported only in Yarn mode, this should be fine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/4292 [SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs .this was https://github.com/apache/spark/pull/2676 https://issues.apache.org/jira/browse/SPARK-3778 This affects if someone is trying to access secure hdfs something like: val lines = { val hconf = new Configuration() hconf.set(mapred.input.dir, mydir) hconf.set(textinputformat.record.delimiter,\003432\n) sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) } You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-3788 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4292.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4292 commit cf3b45337a1fb1da6492779709b2bf213bccbb16 Author: Thomas Graves tgra...@apache.org Date: 2014-10-06T14:53:29Z newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72223705 I'll try to bring it up to date today. I'm out all next week though so if you find issues someone else might need to take it over. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72233657 [Test build #26408 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26408/consoleFull) for PR 4292 at commit [`cf3b453`](https://github.com/apache/spark/commit/cf3b45337a1fb1da6492779709b2bf213bccbb16). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72233593 @JoshRosen you had looked at this before mind taking another look --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72248045 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72233527 for whatever reason this pull request didn't update. Filed new one: https://github.com/apache/spark/pull/4292 Its rebased and I made the comment change suggested by @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs closed the pull request at: https://github.com/apache/spark/pull/2676 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72244501 [Test build #26408 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26408/consoleFull) for PR 4292 at commit [`cf3b453`](https://github.com/apache/spark/commit/cf3b45337a1fb1da6492779709b2bf213bccbb16). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72244512 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26408/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user harishreedharan commented on the pull request: https://github.com/apache/spark/pull/4292#issuecomment-72304336 +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72133280 @tgravescs mind brining it up to date? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72133319 I bumped the severity per @harishreedharan's commnet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-72133263 Hey guys - sorry don't block on my comment. If you all think this looks good, just merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r21382056 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging { kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)] = { +// mapreduce.Job (NewHadoopJob) merges any credentials for you. --- End diff -- sure I can change the comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-65814785 I was waiting for clarification from @pwendell on my question about his comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-65724296 /bump; what's the status on this PR? It looks like this is small and probably pretty close to being merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r21345797 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging { kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)] = { +// mapreduce.Job (NewHadoopJob) merges any credentials for you. --- End diff -- How about this: The call to `new NewHadoopJob` automatically adds security credentials to `conf`, so we don't need to explicitly add them ourselves --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r19823688 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging { kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)] = { +// mapreduce.Job (NewHadoopJob) merges any credentials for you. --- End diff -- This was just saying that the call to new NewHadoopJob adds the credentials to the conf passed in for you, so we don't need an explicit call to SparkHadoopUtil.get.addCredentials(jconf). I can try to rephrase --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r19825969 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -661,7 +662,10 @@ class SparkContext(config: SparkConf) extends Logging { fClass: Class[F], kClass: Class[K], vClass: Class[V]): RDD[(K, V)] = { -new NewHadoopRDD(this, fClass, kClass, vClass, conf) +// Add necessary security credentials to the JobConf. Required to access secure HDFS. +val jconf = new JobConf(conf) +SparkHadoopUtil.get.addCredentials(jconf) +new NewHadoopRDD(this, fClass, kClass, vClass, jconf) --- End diff -- Sorry I don't see what you are referring to in hadoopFile? Do you mean hadoopRDD, which already takes a Jobconf? We need a JobConf for addCredentials routine to work. I could look at add the credentials another way but the nice thing about creating another JobConf is if the user passes in a JobConf instance creating a new one handles transferring any credentials from that one as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-61484885 ping @pwendell @mateiz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r19773894 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging { kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)] = { +// mapreduce.Job (NewHadoopJob) merges any credentials for you. --- End diff -- This comment isn't very clear to me. What is it trying to say? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user pwendell commented on a diff in the pull request: https://github.com/apache/spark/pull/2676#discussion_r19773950 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -661,7 +662,10 @@ class SparkContext(config: SparkConf) extends Logging { fClass: Class[F], kClass: Class[K], vClass: Class[V]): RDD[(K, V)] = { -new NewHadoopRDD(this, fClass, kClass, vClass, conf) +// Add necessary security credentials to the JobConf. Required to access secure HDFS. +val jconf = new JobConf(conf) +SparkHadoopUtil.get.addCredentials(jconf) +new NewHadoopRDD(this, fClass, kClass, vClass, jconf) --- End diff -- In `hadoopFile` we actually mutate the existing conf. Should we modify this one or that one to be more consistent? It does seem more sensible here to duplicate it and then modify the copy. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/2676 [SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure hdfs https://issues.apache.org/jira/browse/SPARK-3778 This affects if someone is trying to access secure hdfs something like: val lines = { val hconf = new Configuration() hconf.set(mapred.input.dir, mydir) hconf.set(textinputformat.record.delimiter,\003432\n) sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) } You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-3778 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2676.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2676 commit c3d6b83332b1ba370bff837d7be09ffd30243262 Author: Thomas Graves tgra...@apache.org Date: 2014-10-06T14:53:29Z newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-58030636 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21331/consoleFull) for PR 2676 at commit [`c3d6b83`](https://github.com/apache/spark/commit/c3d6b83332b1ba370bff837d7be09ffd30243262). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-58042286 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21331/consoleFull) for PR 2676 at commit [`c3d6b83`](https://github.com/apache/spark/commit/c3d6b83332b1ba370bff837d7be09ffd30243262). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-58042300 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21331/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2676#issuecomment-58055621 LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org