[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-09 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-73537266
  
sorry for my delay, I was out last week.

I would have to agree with @JoshRosen last comment. If they have already 
created the RDD then I wouldn't expect any changes to the hadoop configuration 
to apply.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-03 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72710717
  
I think @JoshRosen is right. I don't think we need to worry about the 
change in the conf after the RDD has been defined. That makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72572163
  
LGTM and seems very straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72554774
  
@pwendell - This is a small enough patch - and relatively less risk. It 
would be great to merge this into 1.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4292


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72602613
  
LGTM, too, so I'm going to merge this into `master` (1.3.0).  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/4292#discussion_r23985702
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
 assertNotStopped()
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
--- End diff --

Yep, looks like `addCredentials` is implemented as a no-op, so this should 
be fine.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72603547
  
For example:

```scala
15/02/02 22:49:12 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, 
JobConf, SequenceFileInputFormat, TextInputFormat}
import org.apache.hadoop.mapred.{FileInputFormat, InputFormat, JobConf, 
SequenceFileInputFormat, TextInputFormat}

scala import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration

scala val conf = new Configuration()
conf: org.apache.hadoop.conf.Configuration = Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala val jobConf = new JobConf(conf)
jobConf: org.apache.hadoop.mapred.JobConf = Configuration: 
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml

scala jobConf.getInt(myInt, 0)
res3: Int = 0

scala conf.setInt(myInt, 1)

scala jobConf.getInt(myInt, 0)
res5: Int = 0
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72604443
  
@JoshRosen that's a great point and could cause regressing behavior that 
would be really hard for users to diagnose. @tgravescs. What about deferring 
the injection of the credentials until just before the conf is broadcast?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72604619
  
We could also just leave it as-is and then do something like that if we 
find this is encountered by users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72605117
  
I found some [previous 
discussion](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160842page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160842)
 of this issue.

I'd say that expecting `sc.hadoopConfiguration` to be mutated by users 
after it's already been used to define RDDs isn't something that we can / 
should realistically hope to support because there's just way too many ways 
that it could break (e.g. defensive copying, serialization, etc) and because it 
runs counter to user expectations around other types of Spark configurations 
(e.g. user modifications to SparkConf after creating SparkContext will not take 
effect).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-02-02 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72603238
  
Ugh, I just realized that this might potentially regress behavior for some 
weird corner-cases that arise due to our shared mutable `hadoopConfiguration`.  
A common use-case for `sc.hadoopConfiguration` is to pass credentials for S3 
filesystems.  The problem that crops up is when a user has already defined a 
bunch of RDDs and then mutates the configuration to pass credentials: in this 
case, I think this patch will break those user programs because modifications 
to the `hadoopConfiguration` won't be reflected in the `JobConf`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-31 Thread lianhuiwang
Github user lianhuiwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/4292#discussion_r2348
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
 assertNotStopped()
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
--- End diff --

if mode is not yarn, SparkHadoopUtil.addCredentials didnot do anything. so 
here it donot resolve when non-Yarn mode.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-31 Thread harishreedharan
Github user harishreedharan commented on a diff in the pull request:

https://github.com/apache/spark/pull/4292#discussion_r23893987
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -820,7 +822,10 @@ class SparkContext(config: SparkConf) extends Logging 
with ExecutorAllocationCli
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
 assertNotStopped()
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
--- End diff --

Since security is supported only in Yarn mode, this should be fine. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread tgravescs
GitHub user tgravescs opened a pull request:

https://github.com/apache/spark/pull/4292

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure 
hdfs

.this was https://github.com/apache/spark/pull/2676

https://issues.apache.org/jira/browse/SPARK-3778

This affects if someone is trying to access secure hdfs something like:
val lines = {
val hconf = new Configuration()
hconf.set(mapred.input.dir, mydir)
hconf.set(textinputformat.record.delimiter,\003432\n)
sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text])
}

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgravescs/spark SPARK-3788

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4292.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4292


commit cf3b45337a1fb1da6492779709b2bf213bccbb16
Author: Thomas Graves tgra...@apache.org
Date:   2014-10-06T14:53:29Z

newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-72223705
  
I'll try to bring it up to date today.  I'm out all next week though so if 
you find issues someone else might need to take it over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72233657
  
  [Test build #26408 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26408/consoleFull)
 for   PR 4292 at commit 
[`cf3b453`](https://github.com/apache/spark/commit/cf3b45337a1fb1da6492779709b2bf213bccbb16).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72233593
  
@JoshRosen you had looked at this before mind taking another look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72248045
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-72233527
  
for whatever reason this pull request didn't update.  Filed new one:
https://github.com/apache/spark/pull/4292

Its rebased and I made the comment change suggested by @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread tgravescs
Github user tgravescs closed the pull request at:

https://github.com/apache/spark/pull/2676


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72244501
  
  [Test build #26408 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26408/consoleFull)
 for   PR 4292 at commit 
[`cf3b453`](https://github.com/apache/spark/commit/cf3b45337a1fb1da6492779709b2bf213bccbb16).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72244512
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26408/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-30 Thread harishreedharan
Github user harishreedharan commented on the pull request:

https://github.com/apache/spark/pull/4292#issuecomment-72304336
  
+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-72133280
  
@tgravescs mind brining it up to date?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-72133319
  
I bumped the severity per @harishreedharan's commnet.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2015-01-29 Thread pwendell
Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-72133263
  
Hey guys - sorry don't block on my comment. If you all think this looks 
good, just merge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-12-05 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r21382056
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging {
   kClass: Class[K],
   vClass: Class[V],
   conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
+// mapreduce.Job (NewHadoopJob) merges any credentials for you.
--- End diff --

sure I can change the comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-12-05 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-65814785
  
I was waiting for clarification from @pwendell on my question about his 
comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-12-04 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-65724296
  
/bump; what's the status on this PR?  It looks like this is small and 
probably pretty close to being merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-12-04 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r21345797
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging {
   kClass: Class[K],
   vClass: Class[V],
   conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
+// mapreduce.Job (NewHadoopJob) merges any credentials for you.
--- End diff --

How about this:

 The call to `new NewHadoopJob` automatically adds security credentials 
to `conf`, so we don't need to explicitly add them ourselves


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-11-04 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r19823688
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging {
   kClass: Class[K],
   vClass: Class[V],
   conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
+// mapreduce.Job (NewHadoopJob) merges any credentials for you.
--- End diff --

This was just saying that the call to new NewHadoopJob adds the 
credentials to the conf passed in for you, so we don't need an explicit call to 
SparkHadoopUtil.get.addCredentials(jconf).

I can try to rephrase



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-11-04 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r19825969
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -661,7 +662,10 @@ class SparkContext(config: SparkConf) extends Logging {
   fClass: Class[F],
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
+new NewHadoopRDD(this, fClass, kClass, vClass, jconf)
--- End diff --

Sorry I don't see what you are referring to in hadoopFile?  Do you mean 
hadoopRDD, which already takes a Jobconf?

We need a JobConf for addCredentials routine to work. I could look at add 
the credentials another way but the nice thing about creating another JobConf 
is if the user passes in a JobConf instance creating a new one handles 
transferring any credentials from that one as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-11-03 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-61484885
  
ping @pwendell @mateiz 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-11-03 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r19773894
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -641,6 +641,7 @@ class SparkContext(config: SparkConf) extends Logging {
   kClass: Class[K],
   vClass: Class[V],
   conf: Configuration = hadoopConfiguration): RDD[(K, V)] = {
+// mapreduce.Job (NewHadoopJob) merges any credentials for you.
--- End diff --

This comment isn't very clear to me. What is it trying to say?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-11-03 Thread pwendell
Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/2676#discussion_r19773950
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -661,7 +662,10 @@ class SparkContext(config: SparkConf) extends Logging {
   fClass: Class[F],
   kClass: Class[K],
   vClass: Class[V]): RDD[(K, V)] = {
-new NewHadoopRDD(this, fClass, kClass, vClass, conf)
+// Add necessary security credentials to the JobConf. Required to 
access secure HDFS.
+val jconf = new JobConf(conf)
+SparkHadoopUtil.get.addCredentials(jconf)
+new NewHadoopRDD(this, fClass, kClass, vClass, jconf)
--- End diff --

In `hadoopFile` we actually mutate the existing conf. Should we modify this 
one or that one to be more consistent? It does seem more sensible here to 
duplicate it and then modify the copy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-10-06 Thread tgravescs
GitHub user tgravescs opened a pull request:

https://github.com/apache/spark/pull/2676

[SPARK-3778] newAPIHadoopRDD doesn't properly pass credentials for secure 
hdfs

https://issues.apache.org/jira/browse/SPARK-3778

This affects if someone is trying to access secure hdfs something like:
val lines = {
  val hconf = new Configuration()
  hconf.set(mapred.input.dir, mydir)
  hconf.set(textinputformat.record.delimiter,\003432\n)
  sc.newAPIHadoopRDD(hconf, classOf[TextInputFormat], 
classOf[LongWritable], classOf[Text])
}

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tgravescs/spark SPARK-3778

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2676.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2676


commit c3d6b83332b1ba370bff837d7be09ffd30243262
Author: Thomas Graves tgra...@apache.org
Date:   2014-10-06T14:53:29Z

newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-58030636
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21331/consoleFull)
 for   PR 2676 at commit 
[`c3d6b83`](https://github.com/apache/spark/commit/c3d6b83332b1ba370bff837d7be09ffd30243262).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-10-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-58042286
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21331/consoleFull)
 for   PR 2676 at commit 
[`c3d6b83`](https://github.com/apache/spark/commit/c3d6b83332b1ba370bff837d7be09ffd30243262).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-10-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-58042300
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21331/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3778] newAPIHadoopRDD doesn't properly ...

2014-10-06 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2676#issuecomment-58055621
  
LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org