date:20160707

[GitHub] spark issue #14049: [SPARK-16369][MLlib] tallSkinnyQR of RowMatrix should aw...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14049
  
**[Test build #61932 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61932/consoleFull)**
 for PR 14049 at commit 
[`6705a38`](https://github.com/apache/spark/commit/6705a3861483ded60a1659b9045c111f06e1e0e5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14096
  
Yeah we should just call it with empty columns (instead of all the columns) 
and let the Scala side do the appropriate thing. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14093: SPARK-16420: Ensure compression streams are closed.

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14093
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14093: SPARK-16420: Ensure compression streams are closed.

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14093
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61926/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14093: SPARK-16420: Ensure compression streams are closed.

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14093
  
**[Test build #61926 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61926/consoleFull)**
 for PR 14093 at commit 
[`601f934`](https://github.com/apache/spark/commit/601f934372922b3b68424d3ef5a3cc81fd0f4500).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
FYI, here is the result of Scala example.
```scala
scala> val df = spark.read.json("examples/src/main/resources/people.json")
scala> df.withColumn("boolean", lit(true)).show()
++---+---+
| age|   name|boolean|
++---+---+
|null|Michael|   true|
|  30|   Andy|   true|
|  19| Justin|   true|
++---+---+
scala> df.withColumn("boolean", lit(true)).describe().show()
+---+--+
|summary|   age|
+---+--+
|  count| 2|
|   mean|  24.5|
| stddev|7.7781745930520225|
|min|19|
|max|30|
+---+--+
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14095
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61929/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14095
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14095
  
**[Test build #61929 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61929/consoleFull)**
 for PR 14095 at commit 
[`df2edd7`](https://github.com/apache/spark/commit/df2edd730216e659dbcebdcbda61dd67fbcf8d55).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
I mean `colList <- as.list(c(columns(x)))`. We should not do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13984: [SPARK-16310][SPARKR] R na.string-like default fo...

2016-07-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13984


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
Oh, I see your point.
The difference occurs at `all column retrieval` of SparkR.
We can make this consistently with Scala/Python by removing `all column 
retrieval`.
That would be more simpler!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14082: [SPARK-16381][SQL][SparkR] Update SQL examples and progr...

2016-07-07 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14082
  
I'll take a look at this today. Also cc @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
At here.

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1922


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
Currently, Scala/Python already do column-type checking for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13984: [SPARK-16310][SPARKR] R na.string-like default for csv s...

2016-07-07 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/13984
  
LGTM. Merging this to master and branch-2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13984: [SPARK-16310][SPARKR] R na.string-like default fo...

2016-07-07 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/13984#discussion_r69997940
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -744,6 +747,9 @@ read.df.default <- function(path = NULL, source = NULL, 
schema = NULL, ...) {
   if (is.null(source)) {
 source <- getDefaultSqlSource()
   }
+  if (source == "csv" && is.null(options[["nullValue"]])) {
--- End diff --

Yeah this is the more conservative option - I guess thats fine for now and 
we can revisit this if required.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
This failure happens only in SparkR because SparkR blindly try for every 
columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14096
  
**[Test build #61933 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61933/consoleFull)**
 for PR 14096 at commit 
[`f0bd1d6`](https://github.com/apache/spark/commit/f0bd1d63f5aa4b1ad812a083563409308fab3d42).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14066: [MINOR] [BUILD] Download Maven 3.3.9 instead of 3.3.3 be...

2016-07-07 Thread lresende

Github user lresende commented on the issue:

https://github.com/apache/spark/pull/14066
  
The issue here is that releases keep getting archived when new releases 
comes up. For old releases (or by default) we could use 
https://archive.apache.org/dist/maven/maven-3/, which is always available but 
then put a little more load on the "Apache Infrastructure".  If you think we 
should move to use archive, I could provide a patch... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14096
  
I'm not sure this is something we should be fixing just on R frontend.  
What happens when we run the query from Scala / Python ? If we get the same 
error we should be fixing it in Scala ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14071: [SPARK-16397][SQL] make CatalogTable more general...

2016-07-07 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14071#discussion_r69997365
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala
 ---
@@ -403,17 +400,18 @@ object CreateDataSourceTableUtils extends Logging {
   assert(partitionColumns.isEmpty)
   assert(relation.partitionSchema.isEmpty)
 
+  var storage = CatalogStorageFormat(
+locationUri = None,
--- End diff --

Any reason why this `locationUri` is set to `None`? It sounds like the 
original value is `Some(relation.location.paths.map(_.toUri.toString).head`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14096: [SPARK-16425][R] `describe()` should consider numeric/st...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14096
  
Hi, @shivaram .
Could you review this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14096: [SPARK-16425][R] `describe()` should consider num...

2016-07-07 Thread dongjoon-hyun

GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/14096

[SPARK-16425][R] `describe()` should consider numeric/string-type columns

## What changes were proposed in this pull request?

This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` 
with not-numeric or non-string columns. This failure happens only in `SparkR`.

**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
  org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' 
due to data type mismatch: function average requires numeric types, not 
BooleanType;
```

**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```

## How was this patch tested?

Pass the Jenkins with a updated testcase.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-16425

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14096


commit f0bd1d63f5aa4b1ad812a083563409308fab3d42
Author: Dongjoon Hyun 
Date:   2016-07-07T21:57:59Z

[SPARK-16425][R] `describe()` should consider numeric/string-type columns




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14094
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61927/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14094
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14094
  
**[Test build #61927 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61927/consoleFull)**
 for PR 14094 at commit 
[`ddd9426`](https://github.com/apache/spark/commit/ddd9426281e743af205f2a3f56be3535cd584b2d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14022
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61924/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14022
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14022
  
**[Test build #61924 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61924/consoleFull)**
 for PR 14022 at commit 
[`392bddc`](https://github.com/apache/spark/commit/392bddc57eaefb09c73902ea041f05705d9498aa).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14079
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14079
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61931/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14079
  
**[Test build #61931 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61931/consoleFull)**
 for PR 14079 at commit 
[`cf58374`](https://github.com/apache/spark/commit/cf5837410818dae093ef15617cb42336a14408db).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14049: [SPARK-16369][MLlib] tallSkinnyQR of RowMatrix should aw...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14049
  
**[Test build #61932 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61932/consoleFull)**
 for PR 14049 at commit 
[`6705a38`](https://github.com/apache/spark/commit/6705a3861483ded60a1659b9045c111f06e1e0e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69992687
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ServiceTokenProvider.scala
 ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.{Credentials, UserGroupInformation}
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.SparkConf
+
+/**
+ * An interface to provide tokens for service, any service wants to 
communicate with Spark
+ * through token way needs to implement this interface and register into
+ * [[ConfigurableTokenManager]] through configurations.
+ */
+trait ServiceTokenProvider {
+
+  /**
+   * Name of the ServiceTokenProvider, should be unique. Using this to 
distinguish different
+   * service.
+   */
+  def serviceName: String
+
+  /**
+   * Used to indicate whether a token is required.
+   */
+  def isTokenRequired(conf: Configuration): Boolean = {
+UserGroupInformation.isSecurityEnabled
+  }
+
+  /**
+   *  Obtain tokens from this service, tokens will be added into 
Credentials and return as array.
+   */
+  def obtainTokensFromService(
--- End diff --

If you follow Tom's suggestion and turn this into a generic 
"obtainCredentials" method, then you could potentially merge it with 
`getTimeFromNowToRenewal` too.

e.g. the provider is responsible for adding the tokens to the `Credentials` 
object, and it returns when it should be called again to renew those tokens (or 
obtain new credentials). One less method in the interface!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14092
  
**[Test build #3169 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3169/consoleFull)**
 for PR 14092 at commit 
[`b4b02bf`](https://github.com/apache/spark/commit/b4b02bf3879daf9a4532b61a019ea33b0f3ff835).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14079
  
**[Test build #61931 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61931/consoleFull)**
 for PR 14079 at commit 
[`cf58374`](https://github.com/apache/spark/commit/cf5837410818dae093ef15617cb42336a14408db).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14095
  
**[Test build #61930 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61930/consoleFull)**
 for PR 14095 at commit 
[`b6673cb`](https://github.com/apache/spark/commit/b6673cb9ba1e9b5095ceaee8343aac08cc9aea5c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/14095
  
Thank you for fast review, @rxin . I updated it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14079: [SPARK-8425][CORE] New Blacklist Mechanism

2016-07-07 Thread squito

Github user squito commented on the issue:

https://github.com/apache/spark/pull/14079
  
I took another look at having BlacklistTracker just be an option, rather 
than having a NoopBlacklist.  After some other cleanup, I decided it made more 
sense to go back to the option, but its in one commit so easy to go either way 
https://github.com/apache/spark/pull/14079/commits/a34e9aeb695958c749d306595d1adebe0207fdf9


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14092
  
**[Test build #3168 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3168/consoleFull)**
 for PR 14092 at commit 
[`b4b02bf`](https://github.com/apache/spark/commit/b4b02bf3879daf9a4532b61a019ea33b0f3ff835).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14095: [SPARK-16429][SQL] Include `StringType` columns i...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/14095#discussion_r69991546
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -228,6 +228,15 @@ class Dataset[T] private[sql](
 }
   }
 
+  private[sql] def aggregatableColumns: Seq[Expression] = {
--- End diff --

That would be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69991485
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ServiceTokenProvider.scala
 ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.{Credentials, UserGroupInformation}
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.SparkConf
+
+/**
+ * An interface to provide tokens for service, any service wants to 
communicate with Spark
+ * through token way needs to implement this interface and register into
+ * [[ConfigurableTokenManager]] through configurations.
+ */
+trait ServiceTokenProvider {
+
+  /**
+   * Name of the ServiceTokenProvider, should be unique. Using this to 
distinguish different
+   * service.
+   */
+  def serviceName: String
+
+  /**
+   * Used to indicate whether a token is required.
+   */
+  def isTokenRequired(conf: Configuration): Boolean = {
+UserGroupInformation.isSecurityEnabled
+  }
+
+  /**
+   *  Obtain tokens from this service, tokens will be added into 
Credentials and return as array.
+   */
+  def obtainTokensFromService(
+  sparkConf: SparkConf,
+  serviceConf: Configuration,
+  creds: Credentials): Array[Token[_]]
+}
+
+/**
+ * An interface for service in which token can be renewable, any 
[[ServiceTokenProvider]] in which
+ * token can be renewable should also implement this interface, Spark's 
internal time-based
+ * token renewal mechanism will invoke the methods to update the tokens 
periodically.
+ */
+trait ServiceTokenRenewable {
+
+  /**
+   * Get the token renewal interval from this service. This renewal 
interval will be used in
+   * periodical token renewal mechanism.
+   */
+  def getTokenRenewalInterval(sparkConf: SparkConf, serviceConf: 
Configuration): Long
+
+  /**
+   * Get the time length from now to next renewal.
+   */
+  def getTimeFromNowToRenewal(
--- End diff --

You only really need this method in the interface, right? The token 
provider should know what info it needs to calculate this value. It might not 
event need `getTokenRenewalInterval` for that (HDFS does, but then that logic 
should live inside the HDFS provider).

At that point, you could just merge both interfaces and have this method 
return an `Option` (None = no renewal necessary), or some magic value (e.g. 
`-1`) to indicate no renewal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69990872
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ServiceTokenProvider.scala
 ---
@@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.{Credentials, UserGroupInformation}
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.SparkConf
+
+/**
+ * An interface to provide tokens for service, any service wants to 
communicate with Spark
+ * through token way needs to implement this interface and register into
+ * [[ConfigurableTokenManager]] through configurations.
+ */
+trait ServiceTokenProvider {
+
+  /**
+   * Name of the ServiceTokenProvider, should be unique. Using this to 
distinguish different
+   * service.
+   */
+  def serviceName: String
+
+  /**
+   * Used to indicate whether a token is required.
+   */
+  def isTokenRequired(conf: Configuration): Boolean = {
+UserGroupInformation.isSecurityEnabled
+  }
+
+  /**
+   *  Obtain tokens from this service, tokens will be added into 
Credentials and return as array.
+   */
+  def obtainTokensFromService(
+  sparkConf: SparkConf,
+  serviceConf: Configuration,
--- End diff --

Note the name here is misleading. It won't be the service's conf, but 
really a `YarnConfiguration`. Note how both Hive and HBase providers have to 
load their own configuration to be able to see service-specific settings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69990265
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ConfigurableTokenManager.scala
 ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * A [[ConfigurableTokenManager]] to manage all the token providers 
register in this class. Also
+ * it provides other modules the functionality to obtain tokens, get token 
renewal interval and
+ * calculate the time length till next renewal.
+ *
+ * By default ConfigurableTokenManager has 3 built-in token providers, 
HDFSTokenProvider,
+ * HiveTokenProvider and HBaseTokenProvider, and this 3 token providers 
can also be controlled
+ * by configuration spark.yarn.security.tokens.{service}.enabled, if it is 
set to false, this
+ * provider will not be loaded.
+ *
+ * For other token providers which need to be loaded in should:
+ * 1. Implement [[ServiceTokenProvider]] or [[ServiceTokenRenewable]] if 
token renewal is
+ * required for this service.
+ * 2. set spark.yarn.security.tokens.{service}.enabled to true
+ * 3. Specify the class name through 
spark.yarn.security.tokens.{service}.class
+ *
+ */
+final class ConfigurableTokenManager private[yarn] (sparkConf: SparkConf) 
extends Logging {
+  private val tokenProviderEnabledConfig = 
"spark\\.yarn\\.security\\.tokens\\.(.+)\\.enabled".r
+  private val tokenProviderClsConfig = 
"spark.yarn.security.tokens.%s.class"
+
+  // Maintain all the registered token providers
+  private val tokenProviders = mutable.HashMap[String, 
ServiceTokenProvider]()
+
+  private val defaultTokenProviders = Map(
+"hdfs" -> "org.apache.spark.deploy.yarn.token.HDFSTokenProvider",
+"hive" -> "org.apache.spark.deploy.yarn.token.HiveTokenProvider",
+"hbase" -> "org.apache.spark.deploy.yarn.token.HBaseTokenProvider"
+  )
+
+  // AMDelegationTokenRenewer, this will only be create and started in the 
AM
+  private var _delegationTokenRenewer: AMDelegationTokenRenewer = null
+
+  // ExecutorDelegationTokenUpdater, this will only be created and started 
in the driver and
+  // executor side.
+  private var _delegationTokenUpdater: ExecutorDelegationTokenUpdater = 
null
+
+  def initialize(): Unit = {
+// Copy SparkConf and add default enabled token provider 
configurations to SparkConf.
+val clonedConf = sparkConf.clone
+defaultTokenProviders.keys.foreach { key =>
+  clonedConf.setIfMissing(s"spark.yarn.security.tokens.$key.enabled", 
"true")
+}
+
+// Instantialize all the service token providers according to the 
configurations.
+clonedConf.getAll.filter { case (key, value) =>
+  if (tokenProviderEnabledConfig.findPrefixOf(key).isDefined) {
+value.toBoolean
+  } else {
+false
+  }
+}.map { case (key, _) =>
+  val tokenProviderEnabledConfig(service) = key
+  val cls = sparkConf.getOption(tokenProviderClsConfig.format(service))
+.orElse(defaultTokenProviders.get(service))
+  (service, cls)
+}.foreach { case (service, cls) =>
+  if (cls.isDefined) {
+try {
+  val tokenProvider =
+
Utils.classForName(cls.get).newInstance().asInstanceOf[ServiceTokenProvider]
+  tokenProviders += (service -> tokenProvider)
+} catch {
+  case NonFatal(e) =>
+logWarning(s"Fail to instantiate class ${cls.get}", e)
+}

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69989978
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -93,18 +113,34 @@ protected void handleMessage(
   client.getClientId(),
   NettyUtils.getRemoteAddress(client.getChannel()));
   callback.onSuccess(new StreamHandle(streamId, 
msg.blockIds.length).toByteBuffer());
+  transferBlockRate.mark(totalBlockSize / 1024 / 1024);
+  responseDelayContext.stop();
 
 } else if (msgObj instanceof RegisterExecutor) {
+  final Timer.Context responseDelayContext = 
timeDelayForRegisterExecutorRequest.time();
   RegisterExecutor msg = (RegisterExecutor) msgObj;
   checkAuth(client, msg.appId);
   blockManager.registerExecutor(msg.appId, msg.execId, 
msg.executorInfo);
   callback.onSuccess(ByteBuffer.wrap(new byte[0]));
+  responseDelayContext.stop();
 
 } else {
   throw new UnsupportedOperationException("Unexpected message: " + 
msgObj);
 }
   }
 
+  public MetricSet getAllMetrics() {
+return metrics;
+  }
+
+  public long getRegisteredExecutorsSize() {
+return blockManager.getRegisteredExecutorsSize();
+  }
+
+  public long getTotalShuffleRequests() {
+return timeDelayForOpenBlockRequest.getCount() + 
timeDelayForOpenBlockRequest.getCount();
--- End diff --

Btw I don't think you need this metric, the client can easily derive it 
from the other metrics.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69989897
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ConfigurableTokenManager.scala
 ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * A [[ConfigurableTokenManager]] to manage all the token providers 
register in this class. Also
+ * it provides other modules the functionality to obtain tokens, get token 
renewal interval and
+ * calculate the time length till next renewal.
+ *
+ * By default ConfigurableTokenManager has 3 built-in token providers, 
HDFSTokenProvider,
+ * HiveTokenProvider and HBaseTokenProvider, and this 3 token providers 
can also be controlled
+ * by configuration spark.yarn.security.tokens.{service}.enabled, if it is 
set to false, this
+ * provider will not be loaded.
+ *
+ * For other token providers which need to be loaded in should:
+ * 1. Implement [[ServiceTokenProvider]] or [[ServiceTokenRenewable]] if 
token renewal is
+ * required for this service.
+ * 2. set spark.yarn.security.tokens.{service}.enabled to true
+ * 3. Specify the class name through 
spark.yarn.security.tokens.{service}.class
+ *
+ */
+final class ConfigurableTokenManager private[yarn] (sparkConf: SparkConf) 
extends Logging {
+  private val tokenProviderEnabledConfig = 
"spark\\.yarn\\.security\\.tokens\\.(.+)\\.enabled".r
+  private val tokenProviderClsConfig = 
"spark.yarn.security.tokens.%s.class"
+
+  // Maintain all the registered token providers
+  private val tokenProviders = mutable.HashMap[String, 
ServiceTokenProvider]()
+
+  private val defaultTokenProviders = Map(
+"hdfs" -> "org.apache.spark.deploy.yarn.token.HDFSTokenProvider",
+"hive" -> "org.apache.spark.deploy.yarn.token.HiveTokenProvider",
+"hbase" -> "org.apache.spark.deploy.yarn.token.HBaseTokenProvider"
+  )
+
+  // AMDelegationTokenRenewer, this will only be create and started in the 
AM
+  private var _delegationTokenRenewer: AMDelegationTokenRenewer = null
+
+  // ExecutorDelegationTokenUpdater, this will only be created and started 
in the driver and
+  // executor side.
+  private var _delegationTokenUpdater: ExecutorDelegationTokenUpdater = 
null
+
+  def initialize(): Unit = {
+// Copy SparkConf and add default enabled token provider 
configurations to SparkConf.
+val clonedConf = sparkConf.clone
+defaultTokenProviders.keys.foreach { key =>
+  clonedConf.setIfMissing(s"spark.yarn.security.tokens.$key.enabled", 
"true")
+}
+
+// Instantialize all the service token providers according to the 
configurations.
+clonedConf.getAll.filter { case (key, value) =>
+  if (tokenProviderEnabledConfig.findPrefixOf(key).isDefined) {
+value.toBoolean
+  } else {
+false
+  }
+}.map { case (key, _) =>
+  val tokenProviderEnabledConfig(service) = key
+  val cls = sparkConf.getOption(tokenProviderClsConfig.format(service))
+.orElse(defaultTokenProviders.get(service))
+  (service, cls)
+}.foreach { case (service, cls) =>
+  if (cls.isDefined) {
+try {
+  val tokenProvider =
+
Utils.classForName(cls.get).newInstance().asInstanceOf[ServiceTokenProvider]
+  tokenProviders += (service -> tokenProvider)
+} catch {
+  case NonFatal(e) =>
+logWarning(s"Fail to instantiate class ${cls.get}", e)
+}

[GitHub] spark issue #14080: [SPARK-16405] Add metrics and source for external shuffl...

2016-07-07 Thread ericl

Github user ericl commented on the issue:

https://github.com/apache/spark/pull/14080
  
Thanks for adding these metrics. Could you also add some unit tests to 
sanity check these metrics are recorded as expected, e.g. as in 
https://github.com/apache/spark/pull/13934/files


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13984: [SPARK-16310][SPARKR] R na.string-like default fo...

2016-07-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13984#discussion_r69989660
  
--- Diff: R/pkg/R/SQLContext.R ---
@@ -744,6 +747,9 @@ read.df.default <- function(path = NULL, source = NULL, 
schema = NULL, ...) {
   if (is.null(source)) {
 source <- getDefaultSqlSource()
   }
+  if (source == "csv" && is.null(options[["nullValue"]])) {
--- End diff --

Possibly. I wonder if we should conservative - since data source API is 
extensible - perhaps a new source `nullValue` could cause an unexpected 
behavior change?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69989547
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ConfigurableTokenManager.scala
 ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * A [[ConfigurableTokenManager]] to manage all the token providers 
register in this class. Also
+ * it provides other modules the functionality to obtain tokens, get token 
renewal interval and
+ * calculate the time length till next renewal.
+ *
+ * By default ConfigurableTokenManager has 3 built-in token providers, 
HDFSTokenProvider,
+ * HiveTokenProvider and HBaseTokenProvider, and this 3 token providers 
can also be controlled
+ * by configuration spark.yarn.security.tokens.{service}.enabled, if it is 
set to false, this
+ * provider will not be loaded.
+ *
+ * For other token providers which need to be loaded in should:
+ * 1. Implement [[ServiceTokenProvider]] or [[ServiceTokenRenewable]] if 
token renewal is
+ * required for this service.
+ * 2. set spark.yarn.security.tokens.{service}.enabled to true
+ * 3. Specify the class name through 
spark.yarn.security.tokens.{service}.class
+ *
+ */
+final class ConfigurableTokenManager private[yarn] (sparkConf: SparkConf) 
extends Logging {
+  private val tokenProviderEnabledConfig = 
"spark\\.yarn\\.security\\.tokens\\.(.+)\\.enabled".r
+  private val tokenProviderClsConfig = 
"spark.yarn.security.tokens.%s.class"
+
+  // Maintain all the registered token providers
+  private val tokenProviders = mutable.HashMap[String, 
ServiceTokenProvider]()
+
+  private val defaultTokenProviders = Map(
+"hdfs" -> "org.apache.spark.deploy.yarn.token.HDFSTokenProvider",
+"hive" -> "org.apache.spark.deploy.yarn.token.HiveTokenProvider",
+"hbase" -> "org.apache.spark.deploy.yarn.token.HBaseTokenProvider"
+  )
+
+  // AMDelegationTokenRenewer, this will only be create and started in the 
AM
+  private var _delegationTokenRenewer: AMDelegationTokenRenewer = null
+
+  // ExecutorDelegationTokenUpdater, this will only be created and started 
in the driver and
+  // executor side.
+  private var _delegationTokenUpdater: ExecutorDelegationTokenUpdater = 
null
+
+  def initialize(): Unit = {
--- End diff --

A lot of this method would go away by using `java.util.ServiceLoader`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69989502
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -143,4 +179,26 @@ private void checkAuth(TransportClient client, String 
appId) {
 }
   }
 
+  /**
+   * A simple class to wrap all shuffle service wrapper metrics
+   */
+  private class ShuffleMetrics implements MetricSet {
+private final Map allMetrics;
+private final Timer timeDelayForOpenBlockRequest = new Timer();
+private final Timer timeDelayForRegisterExecutorRequest = new Timer();
+private final Meter transferBlockRate = new Meter();
--- End diff --

Can you add comments describing the metrics and their units? e.g. bytes, 
milliseconds

Also consider renaming them for clarity, I think 
`openBlockRequestLatencyMillis`, `registerExecutorRequestLatencyMillis`, 
`blockTransferRateBytes` would be more clear to the reader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69989212
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/token/ConfigurableTokenManager.scala
 ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.yarn.token
+
+import scala.collection.mutable
+import scala.util.control.NonFatal
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.security.Credentials
+import org.apache.hadoop.security.token.Token
+
+import org.apache.spark.{SparkConf, SparkException}
+import org.apache.spark.internal.Logging
+import org.apache.spark.util.Utils
+
+/**
+ * A [[ConfigurableTokenManager]] to manage all the token providers 
register in this class. Also
+ * it provides other modules the functionality to obtain tokens, get token 
renewal interval and
+ * calculate the time length till next renewal.
+ *
+ * By default ConfigurableTokenManager has 3 built-in token providers, 
HDFSTokenProvider,
+ * HiveTokenProvider and HBaseTokenProvider, and this 3 token providers 
can also be controlled
+ * by configuration spark.yarn.security.tokens.{service}.enabled, if it is 
set to false, this
+ * provider will not be loaded.
+ *
+ * For other token providers which need to be loaded in should:
+ * 1. Implement [[ServiceTokenProvider]] or [[ServiceTokenRenewable]] if 
token renewal is
+ * required for this service.
+ * 2. set spark.yarn.security.tokens.{service}.enabled to true
+ * 3. Specify the class name through 
spark.yarn.security.tokens.{service}.class
+ *
+ */
+final class ConfigurableTokenManager private[yarn] (sparkConf: SparkConf) 
extends Logging {
+  private val tokenProviderEnabledConfig = 
"spark\\.yarn\\.security\\.tokens\\.(.+)\\.enabled".r
+  private val tokenProviderClsConfig = 
"spark.yarn.security.tokens.%s.class"
+
+  // Maintain all the registered token providers
+  private val tokenProviders = mutable.HashMap[String, 
ServiceTokenProvider]()
+
+  private val defaultTokenProviders = Map(
--- End diff --

I'd rather use `java.util.ServiceLoader` for this. You'll need something 
like that at some point anyway, to support other token providers. Doing that 
now has the extra benefit of using the same code for built-in and third party 
providers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13993: [SPARK-16144][SPARKR] update R API doc for mllib

2016-07-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13993#discussion_r69989200
  
--- Diff: R/pkg/R/mllib.R ---
@@ -53,26 +53,27 @@ setClass("AFTSurvivalRegressionModel", 
representation(jobj = "jobj"))
 #' @note KMeansModel since 2.0.0
 setClass("KMeansModel", representation(jobj = "jobj"))
 
-#' Saves the machine learning model to the input path
+#' Saves the MLlib model to the input path
 #'
-#' Saves the machine learning model to the input path. For more 
information, see the specific
-#' machine learning model below.
+#' Saves the MLlib model to the input path. For more information, see the 
specific
+#' MLlib model below.
 #' @rdname write.ml
 #' @name write.ml
 #' @export
-#' @seealso \link{spark.glm}, \link{spark.kmeans}, 
\link{spark.naiveBayes}, \link{spark.survreg}
+#' @seealso \link{spark.glm}, \link{glm}
+#' @seealso \link{spark.kmeans}, \link{spark.naiveBayes}, 
\link{spark.survreg}
 #' @seealso \link{read.ml}
 NULL
 
-#' Predicted values based on a machine learning model
+#' Makes predictions from a MLlib model
 #'
-#' Predicted values based on a machine learning model. For more 
information, see the specific
-#' machine learning model below.
+#' Makes predictions from a MLlib model. For more information, see the 
specific
--- End diff --

Similarly, here, the plural form is the convention. Please see eg. 
https://github.com/apache/spark/pull/13993/files#diff-7ede1519b4a56647801b51af33c2dd18R81


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13993: [SPARK-16144][SPARKR] update R API doc for mllib

2016-07-07 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/13993#discussion_r69989003
  
--- Diff: R/pkg/R/mllib.R ---
@@ -53,26 +53,27 @@ setClass("AFTSurvivalRegressionModel", 
representation(jobj = "jobj"))
 #' @note KMeansModel since 2.0.0
 setClass("KMeansModel", representation(jobj = "jobj"))
 
-#' Saves the machine learning model to the input path
+#' Saves the MLlib model to the input path
--- End diff --

I think the convention that has been suggested is that we have the page 
title being the same first sentence of the description?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69988867
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala ---
@@ -96,237 +87,19 @@ class YarnSparkHadoopUtil extends SparkHadoopUtil {
 if (credentials != null) credentials.getSecretKey(new Text(key)) else 
null
   }
 
-  /**
-   * Get the list of namenodes the user may access.
-   */
-  def getNameNodesToAccess(sparkConf: SparkConf): Set[Path] = {
-sparkConf.get(NAMENODES_TO_ACCESS)
-  .map(new Path(_))
-  .toSet
-  }
-
-  def getTokenRenewer(conf: Configuration): String = {
-val delegTokenRenewer = Master.getMasterPrincipal(conf)
-logDebug("delegation token renewer is: " + delegTokenRenewer)
-if (delegTokenRenewer == null || delegTokenRenewer.length() == 0) {
-  val errorMessage = "Can't get Master Kerberos principal for use as 
renewer"
-  logError(errorMessage)
-  throw new SparkException(errorMessage)
-}
-delegTokenRenewer
-  }
-
-  /**
-   * Obtains tokens for the namenodes passed in and adds them to the 
credentials.
-   */
-  def obtainTokensForNamenodes(
-paths: Set[Path],
-conf: Configuration,
-creds: Credentials,
-renewer: Option[String] = None
-  ): Unit = {
-if (UserGroupInformation.isSecurityEnabled()) {
-  val delegTokenRenewer = renewer.getOrElse(getTokenRenewer(conf))
-  paths.foreach { dst =>
-val dstFs = dst.getFileSystem(conf)
-logInfo("getting token for namenode: " + dst)
-dstFs.addDelegationTokens(delegTokenRenewer, creds)
-  }
-}
-  }
-
-  /**
-   * Obtains token for the Hive metastore and adds them to the credentials.
-   */
-  def obtainTokenForHiveMetastore(
-  sparkConf: SparkConf,
-  conf: Configuration,
-  credentials: Credentials) {
-if (shouldGetTokens(sparkConf, "hive") && 
UserGroupInformation.isSecurityEnabled) {
-  YarnSparkHadoopUtil.get.obtainTokenForHiveMetastore(conf).foreach {
-credentials.addToken(new Text("hive.server2.delegation.token"), _)
-  }
-}
-  }
-
-  /**
-   * Obtain a security token for HBase.
-   */
-  def obtainTokenForHBase(
-  sparkConf: SparkConf,
-  conf: Configuration,
-  credentials: Credentials): Unit = {
-if (shouldGetTokens(sparkConf, "hbase") && 
UserGroupInformation.isSecurityEnabled) {
-  YarnSparkHadoopUtil.get.obtainTokenForHBase(conf).foreach { token =>
-credentials.addToken(token.getService, token)
-logInfo("Added HBase security token to credentials.")
-  }
-}
-  }
-
-  /**
-   * Return whether delegation tokens should be retrieved for the given 
service when security is
-   * enabled. By default, tokens are retrieved, but that behavior can be 
changed by setting
-   * a service-specific configuration.
-   */
-  private def shouldGetTokens(conf: SparkConf, service: String): Boolean = 
{
-conf.getBoolean(s"spark.yarn.security.tokens.${service}.enabled", true)
-  }
-
   private[spark] override def 
startExecutorDelegationTokenRenewer(sparkConf: SparkConf): Unit = {
-tokenRenewer = Some(new ExecutorDelegationTokenUpdater(sparkConf, 
conf))
-tokenRenewer.get.updateCredentialsIfRequired()
+configurableTokenManager(sparkConf).delegationTokenUpdater(conf)
--- End diff --

I find this syntax a little confusing. You're calling 
`configurableTokenManager(sparkConf)` in a bunch of different places. To me 
that looks like either:

- each call is creating a new token manager
- there's some cache of token managers somewhere keyed by the spark 
configuration passed here

Neither sounds good to me. And the actual implementation is actually 
neither: there's a single token manager singleton that is instantiated in the 
first call to `configurableTokenManager`.

Why doesn't `Client` instantiate a token manager in its constructor 
instead? Another option is to have an explicit method in 
`ConfigurableTokenManager` to initialize the singleton, although I'm not a fan 
of singletons in general.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14095: [SPARK-16429][SQL] Include `StringType` columns i...

2016-07-07 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14095#discussion_r69988758
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -228,6 +228,15 @@ class Dataset[T] private[sql](
 }
   }
 
+  private[sql] def aggregatableColumns: Seq[Expression] = {
--- End diff --

private rather than private sql?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69988593
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/ExternalShuffleServiceSource.scala 
---
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy
+
+import javax.annotation.concurrent.ThreadSafe
+
+import com.codahale.metrics.{Gauge, MetricRegistry}
+
+import org.apache.spark.metrics.source.Source
+import org.apache.spark.network.shuffle.ExternalShuffleBlockHandler
+
+/**
+ * Provides metrics source for external shuffle service
+ */
+@ThreadSafe
+private class ExternalShuffleServiceSource
+(blockHandler: ExternalShuffleBlockHandler) extends Source {
+  override val metricRegistry = new MetricRegistry()
+  override val sourceName = "shuffleService"
+
+  metricRegistry.registerAll(blockHandler.getAllMetrics)
+
+  metricRegistry.register(MetricRegistry.name("registeredExecutorsSize"),
--- End diff --

Rather than creating these metrics externally here, consider putting them 
inside the `metricSet`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69988464
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -93,18 +113,34 @@ protected void handleMessage(
   client.getClientId(),
   NettyUtils.getRemoteAddress(client.getChannel()));
   callback.onSuccess(new StreamHandle(streamId, 
msg.blockIds.length).toByteBuffer());
+  transferBlockRate.mark(totalBlockSize / 1024 / 1024);
+  responseDelayContext.stop();
 
 } else if (msgObj instanceof RegisterExecutor) {
+  final Timer.Context responseDelayContext = 
timeDelayForRegisterExecutorRequest.time();
   RegisterExecutor msg = (RegisterExecutor) msgObj;
   checkAuth(client, msg.appId);
   blockManager.registerExecutor(msg.appId, msg.execId, 
msg.executorInfo);
   callback.onSuccess(ByteBuffer.wrap(new byte[0]));
+  responseDelayContext.stop();
--- End diff --

Consider putting all `stop` calls in a finally block.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread ericl

Github user ericl commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69988337
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -64,6 +75,10 @@ public ExternalShuffleBlockHandler(TransportConf conf, 
File registeredExecutorFi
   public ExternalShuffleBlockHandler(
   OneForOneStreamManager streamManager,
   ExternalShuffleBlockResolver blockManager) {
+this.metrics = new ShuffleMetrics();
+this.timeDelayForOpenBlockRequest = 
metrics.timeDelayForOpenBlockRequest;
+this.timeDelayForRegisterExecutorRequest = 
metrics.timeDelayForRegisterExecutorRequest;
--- End diff --

It's a little confusing how this metric is duplicated as a class member. 
Would it work to just reference it through `metrics`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14065: [SPARK-14743][YARN][WIP] Add a configurable token...

2016-07-07 Thread vanzin

Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/14065#discussion_r69988196
  
--- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala ---
@@ -390,8 +390,9 @@ private[spark] class Client(
 // Upload Spark and the application JAR to the remote file system if 
necessary,
 // and add them as local resources to the application master.
 val fs = destDir.getFileSystem(hadoopConf)
-val nns = YarnSparkHadoopUtil.get.getNameNodesToAccess(sparkConf) + 
destDir
-YarnSparkHadoopUtil.get.obtainTokensForNamenodes(nns, hadoopConf, 
credentials)
+hdfsTokenProvider(sparkConf).setNameNodesToAccess(sparkConf, 
Set(destDir))
--- End diff --

+1; it would be better if all interactions with token providers were done 
through the common interface; it seems like these HDFS-specific calls could 
easily be moved to the HDFS token provider.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986596
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -312,6 +312,17 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   "a", "30", "b", "3", "c", "1")
   }
 
+  test("groupBy function, mapValues, flatMap") {
+val ds = Seq(("a", 10), ("a", 20), ("b", 1), ("b", 2), ("c", 1)).toDS()
--- End diff --

Just `.toDS`? (no brackets)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986532
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupByKey(_._1).mapValues(_._2) // Scala
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
+val withNewData = AppendColumns(func, dataAttributes, logicalPlan)
+val projected = Project(withNewData.newColumns ++ groupingAttributes, 
withNewData)
+val executed = sparkSession.sessionState.executePlan(projected)
+
+new KeyValueGroupedDataset(
+  encoderFor[K],
+  encoderFor[W],
+  executed,
+  withNewData.newColumns,
+  groupingAttributes)
+  }
+
+  /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create Integer values grouped by String key from a 
Dataset>
+   *   Dataset> ds = ...;
+   *   KeyValueGroupedDataset grouped =
+   * ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, 
Encoders.INT()); // Java 8
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W](func: MapFunction[V, W], encoder: Encoder[W]): 
KeyValueGroupedDataset[K, W] = {
+implicit val uEnc = encoder
+mapValues{ (v: V) => func.call(v) }
--- End diff --

A space before `{`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986479
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupByKey(_._1).mapValues(_._2) // Scala
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
+val withNewData = AppendColumns(func, dataAttributes, logicalPlan)
+val projected = Project(withNewData.newColumns ++ groupingAttributes, 
withNewData)
+val executed = sparkSession.sessionState.executePlan(projected)
+
+new KeyValueGroupedDataset(
+  encoderFor[K],
+  encoderFor[W],
+  executed,
+  withNewData.newColumns,
+  groupingAttributes)
+  }
+
+  /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create Integer values grouped by String key from a 
Dataset>
+   *   Dataset> ds = ...;
+   *   KeyValueGroupedDataset grouped =
+   * ds.groupByKey(t -> t._1, Encoders.STRING()).mapValues(t -> t._2, 
Encoders.INT()); // Java 8
+   * }}}
+   * @since 2.0.0
--- End diff --

A new line before `@since`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986420
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupByKey(_._1).mapValues(_._2) // Scala
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
+val withNewData = AppendColumns(func, dataAttributes, logicalPlan)
+val projected = Project(withNewData.newColumns ++ groupingAttributes, 
withNewData)
+val executed = sparkSession.sessionState.executePlan(projected)
+
+new KeyValueGroupedDataset(
+  encoderFor[K],
+  encoderFor[W],
+  executed,
+  withNewData.newColumns,
+  groupingAttributes)
+  }
+
+  /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
--- End diff --

...with the given function `func` applied to the data?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986245
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala ---
@@ -65,6 +65,46 @@ class KeyValueGroupedDataset[K, V] private[sql](
   groupingAttributes)
 
   /**
+   * Returns a new [[KeyValueGroupedDataset]] where the given function has 
been applied to the
+   * data. The grouping key is unchanged by this.
+   *
+   * {{{
+   *   // Create values grouped by key from a Dataset[(K, V)]
+   *   ds.groupByKey(_._1).mapValues(_._2) // Scala
+   * }}}
+   * @since 2.0.0
+   */
+  def mapValues[W: Encoder](func: V => W): KeyValueGroupedDataset[K, W] = {
--- End diff --

...while here `W: Encoder` only after `:`. Why is this inconsistency?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread marmbrus

Github user marmbrus commented on the issue:

https://github.com/apache/spark/pull/14094
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14094: [SPARK-16430][SQL][STREAMING] Add option maxFiles...

2016-07-07 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14094#discussion_r69986165
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -45,6 +47,7 @@ class FileStreamSource(
   private val qualifiedBasePath = fs.makeQualified(new Path(path)) // can 
contains glob patterns
   private val metadataLog = new HDFSMetadataLog[Seq[String]](sparkSession, 
metadataPath)
   private var maxBatchId = metadataLog.getLatest().map(_._1).getOrElse(-1L)
+  private val maxFilesPerBatch = getMaxFilesPerBatch()
--- End diff --

Maybe some scaladoc here about what this parameter does / its purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13526: [SPARK-15780][SQL] Support mapValues on KeyValueG...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13526#discussion_r69986179
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
 ---
@@ -175,6 +175,17 @@ object AppendColumns {
   encoderFor[U].namedExpressions,
   child)
   }
+
+  def apply[T : Encoder, U : Encoder](
--- End diff --

Here you use `T : Encoder`, i.e. with spaces before and after `:` while...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14094: [SPARK-16430][SQL][STREAMING] Add option maxFiles...

2016-07-07 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/14094#discussion_r69985831
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
 ---
@@ -26,6 +27,7 @@ import org.apache.spark.internal.Logging
 import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 import org.apache.spark.sql.execution.datasources.{CaseInsensitiveMap, 
DataSource, ListingFileCatalog, LogicalRelation}
 import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
--- End diff --

Where is this used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13980: [SPARK-16198] [MLlib] [ML] Change access level of...

2016-07-07 Thread husseinhazimeh

Github user husseinhazimeh closed the pull request at:

https://github.com/apache/spark/pull/13980


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14095: [SPARK-16429][SQL] Include `StringType` columns in Scala...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14095
  
**[Test build #61929 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61929/consoleFull)**
 for PR 14095 at commit 
[`df2edd7`](https://github.com/apache/spark/commit/df2edd730216e659dbcebdcbda61dd67fbcf8d55).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69985379
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala
 ---
@@ -331,6 +331,24 @@ class FileStreamSourceSuite extends 
FileStreamSourceTest {
 }
   }
 
+  test("read from textfile") {
+withTempDirs { case (src, tmp) =>
+  val textStream = spark.readStream.textFile(src.getCanonicalPath)
+  val filtered = textStream.filter($"value" contains "keep")
+
+  testStream(filtered)(
+AddTextFileData("drop1\nkeep2\nkeep3", src, tmp),
+CheckAnswer("keep2", "keep3"),
+StopStream,
+AddTextFileData("drop4\nkeep5\nkeep6", src, tmp),
+StartStream(),
--- End diff --

Just wondering why `()` are here while not for `StopStream`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69985100
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
+   * contains a single string column named "value".
+   *
+   * If the directory structure of the text files contains partitioning 
information, those are
+   * ignored in the resulting Dataset. To include partitioning information 
as columns, use `text`.
+   *
+   * Each line in the text files is a new element in the resulting 
Dataset. For example:
+   * {{{
+   *   // Scala:
+   *   spark.read.textFile("/path/to/spark/README.md")
+   *
+   *   // Java:
+   *   spark.read().textFile("/path/to/spark/README.md")
+   * }}}
+   *
+   * @param path input path
+   * @since 2.0.0
+   */
+  def textFile(path: String): Dataset[String] = {
+if (userSpecifiedSchema.nonEmpty) {
+  throw new AnalysisException("User specified schema not supported 
with `textFile`")
+}
+
text(path).select("value").as[String](sparkSession.implicits.newStringEncoder)
--- End diff --

I'm surprised that `sparkSession.implicits.newStringEncoder` is required 
here? Why is `sparkSession.implicits._` not imported here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69985212
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
+   * contains a single string column named "value".
+   *
+   * If the directory structure of the text files contains partitioning 
information, those are
+   * ignored in the resulting Dataset. To include partitioning information 
as columns, use `text`.
+   *
+   * Each line in the text files is a new element in the resulting 
Dataset. For example:
+   * {{{
+   *   // Scala:
+   *   spark.read.textFile("/path/to/spark/README.md")
+   *
+   *   // Java:
+   *   spark.read().textFile("/path/to/spark/README.md")
--- End diff --

s/read/readStream?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69985195
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
+   * contains a single string column named "value".
+   *
+   * If the directory structure of the text files contains partitioning 
information, those are
+   * ignored in the resulting Dataset. To include partitioning information 
as columns, use `text`.
+   *
+   * Each line in the text files is a new element in the resulting 
Dataset. For example:
+   * {{{
+   *   // Scala:
+   *   spark.read.textFile("/path/to/spark/README.md")
--- End diff --

s/read/readStream?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69984805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
+   * contains a single string column named "value".
+   *
+   * If the directory structure of the text files contains partitioning 
information, those are
+   * ignored in the resulting Dataset. To include partitioning information 
as columns, use `text`.
+   *
+   * Each line in the text files is a new element in the resulting 
Dataset. For example:
+   * {{{
+   *   // Scala:
+   *   spark.read.textFile("/path/to/spark/README.md")
+   *
+   *   // Java:
+   *   spark.read().textFile("/path/to/spark/README.md")
+   * }}}
+   *
+   * @param path input path
+   * @since 2.0.0
+   */
+  def textFile(path: String): Dataset[String] = {
+if (userSpecifiedSchema.nonEmpty) {
+  throw new AnalysisException("User specified schema not supported 
with `textFile`")
--- End diff --

user-specified


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69984678
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
+   * contains a single string column named "value".
+   *
+   * If the directory structure of the text files contains partitioning 
information, those are
+   * ignored in the resulting Dataset. To include partitioning information 
as columns, use `text`.
+   *
+   * Each line in the text files is a new element in the resulting 
Dataset. For example:
--- End diff --

s/element/record?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14095: [SPARK-16429][SQL] Include `StringType` columns i...

2016-07-07 Thread dongjoon-hyun

GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/14095

[SPARK-16429][SQL] Include `StringType` columns in Scala/Python `describe()`

## What changes were proposed in this pull request?

Currently, Spark `describe` supports `StringType`. However, Scala/Python 
`describe()` returns a dataset for only all numeric columns. SparkR returns all 
columns. This PR include `StringType` columns in Scala/Python `describe()`, 
`describe` without argument.

**Before**
* Scala
 ```scala
scala> 
spark.read.json("examples/src/main/resources/people.json").describe("age", 
"name").show()
+---+--+---+
|summary|   age|   name|
+---+--+---+
|  count| 2|  3|
|   mean|  24.5|   null|
| stddev|7.7781745930520225|   null|
|min|19|   Andy|
|max|30|Michael|
+---+--+---+
scala> 
spark.read.json("examples/src/main/resources/people.json").describe().show()
+---+--+
|summary|   age|
+---+--+
|  count| 2|
|   mean|  24.5|
| stddev|7.7781745930520225|
|min|19|
|max|30|
+---+--+
```

* Python
 ```
>>> 
spark.read.json("examples/src/main/resources/people.json").describe().show()
+---+--+
|summary|   age|
+---+--+
|  count| 2|
|   mean|  24.5|
| stddev|7.7781745930520225|
|min|19|
|max|30|
+---+--+
```

* R
 ```r
> collect(describe(read.json("examples/src/main/resources/people.json")))
  summaryagename
1   count  2   3
2mean   24.5
3  stddev 7.7781745930520225
4 min 19Andy
5 max 30 Michael
```

**After**
* Scala
 ```scala
scala> 
spark.read.json("examples/src/main/resources/people.json").describe().show()
+---+--+---+

|summary|   age|   name|
+---+--+---+
|  count| 2|  3|
|   mean|  24.5|   null|
| stddev|7.7781745930520225|   null|
|min|19|   Andy|
|max|30|Michael|
+---+--+---+
```

* Python
 ```
>>> 
spark.read.json("examples/src/main/resources/people.json").describe().show()
+---+--+---+
|summary|   age|   name|
+---+--+---+
|  count| 2|  3|
|   mean|  24.5|   null|
| stddev|7.7781745930520225|   null|
|min|19|   Andy|
|max|30|Michael|
+---+--+---+
```

* R
SparkR is the same.

## How was this patch tested?

Pass the Jenkins with a update testcase.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-16429

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14095


commit df2edd730216e659dbcebdcbda61dd67fbcf8d55
Author: Dongjoon Hyun 
Date:   2016-07-07T20:45:26Z

[SPARK-16429][SQL] Include `StringType` columns in Scala/Python `describe()`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14083: [SPARK-16406][SQL] Improve performance of Logical...

2016-07-07 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/14083#discussion_r69984539
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -165,111 +169,99 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] with Logging {
   def resolveQuoted(
   name: String,
   resolver: Resolver): Option[NamedExpression] = {
-resolve(UnresolvedAttribute.parseAttributeName(name), output, resolver)
+
outputAttributeResolver.resolve(UnresolvedAttribute.parseAttributeName(name), 
resolver)
   }
 
   /**
-   * Resolve the given `name` string against the given attribute, 
returning either 0 or 1 match.
-   *
-   * This assumes `name` has multiple parts, where the 1st part is a 
qualifier
-   * (i.e. table name, alias, or subquery alias).
-   * See the comment above `candidates` variable in resolve() for 
semantics the returned data.
+   * Refreshes (or invalidates) any metadata/data cached in the plan 
recursively.
*/
-  private def resolveAsTableColumn(
-  nameParts: Seq[String],
-  resolver: Resolver,
-  attribute: Attribute): Option[(Attribute, List[String])] = {
-assert(nameParts.length > 1)
-if (attribute.qualifier.exists(resolver(_, nameParts.head))) {
-  // At least one qualifier matches. See if remaining parts match.
-  val remainingParts = nameParts.tail
-  resolveAsColumn(remainingParts, resolver, attribute)
-} else {
-  None
-}
+  def refresh(): Unit = children.foreach(_.refresh())
+}
+
+/**
+ * Helper class for (LogicalPlan) attribute resolution. This class indexes 
attributes by their
+ * case-in-sensitive name, and checks potential candidates using the given 
Resolver. Both qualified
--- End diff --

The `resolve` methods takes a `Resolver` as its parameter. This allows us 
to use either case sensitive or insensitive attribute resolution depending on 
the `Resolver` passed. The names of both classes are confusing and I might 
rename the `AttributeResolver` class to `AttributeIndex` or something like 
that...

The `AttributeResolver` creates two indexes based on the lower case 
(qualified) attribute name; we do an initial lookup based on the lower case 
name, and then use the `Resolver` for the actual attribute selection. This 
allows us to do fast(er) and correct lookups.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14094
  
**[Test build #61928 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61928/consoleFull)**
 for PR 14094 at commit 
[`c591007`](https://github.com/apache/spark/commit/c591007452f2fe3b08f99db64a94d88384a9b101).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14087: [SPARK-16411][SQL][STREAMING] Add textFile to Str...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14087#discussion_r69984584
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala 
---
@@ -281,6 +281,31 @@ final class DataStreamReader 
private[sql](sparkSession: SparkSession) extends Lo
   @Experimental
   def text(path: String): DataFrame = format("text").load(path)
 
+  /**
+   * Loads text files and returns a [[Dataset]] of String. The underlying 
schema of the Dataset
--- End diff --

a text file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14083
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61923/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14083
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14083
  
**[Test build #61923 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61923/consoleFull)**
 for PR 14083 at commit 
[`c75ae8d`](https://github.com/apache/spark/commit/c75ae8d892ec46a18342235c39c7002402740b7d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14094
  
**[Test build #61927 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61927/consoleFull)**
 for PR 14094 at commit 
[`ddd9426`](https://github.com/apache/spark/commit/ddd9426281e743af205f2a3f56be3535cd584b2d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14094: [SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrig...

2016-07-07 Thread tdas

Github user tdas commented on the issue:

https://github.com/apache/spark/pull/14094
  
@marmbrus @zsxwing 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14094: [SPARK-16430][SQL][STREAMING] Add option maxFiles...

2016-07-07 Thread tdas

GitHub user tdas opened a pull request:

https://github.com/apache/spark/pull/14094

[SPARK-16430][SQL][STREAMING] Add option maxFilesPerTrigger

## What changes were proposed in this pull request?

An option that limits the file stream source to read 1 file at a time 
enables rate limiting. It has the additional convenience that a static set of 
files can be used like a stream for testing as this will allows those files to 
be considered one at a time.

This PR adds option `maxFilesPerTrigger`.

## How was this patch tested?

New unit test




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tdas/spark SPARK-16430

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14094


commit ddd9426281e743af205f2a3f56be3535cd584b2d
Author: Tathagata Das 
Date:   2016-07-07T20:45:38Z

Add option maxFilesPerTrigger




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14083: [SPARK-16406][SQL] Improve performance of Logical...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14083#discussion_r69982767
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -165,111 +169,99 @@ abstract class LogicalPlan extends 
QueryPlan[LogicalPlan] with Logging {
   def resolveQuoted(
   name: String,
   resolver: Resolver): Option[NamedExpression] = {
-resolve(UnresolvedAttribute.parseAttributeName(name), output, resolver)
+
outputAttributeResolver.resolve(UnresolvedAttribute.parseAttributeName(name), 
resolver)
   }
 
   /**
-   * Resolve the given `name` string against the given attribute, 
returning either 0 or 1 match.
-   *
-   * This assumes `name` has multiple parts, where the 1st part is a 
qualifier
-   * (i.e. table name, alias, or subquery alias).
-   * See the comment above `candidates` variable in resolve() for 
semantics the returned data.
+   * Refreshes (or invalidates) any metadata/data cached in the plan 
recursively.
*/
-  private def resolveAsTableColumn(
-  nameParts: Seq[String],
-  resolver: Resolver,
-  attribute: Attribute): Option[(Attribute, List[String])] = {
-assert(nameParts.length > 1)
-if (attribute.qualifier.exists(resolver(_, nameParts.head))) {
-  // At least one qualifier matches. See if remaining parts match.
-  val remainingParts = nameParts.tail
-  resolveAsColumn(remainingParts, resolver, attribute)
-} else {
-  None
-}
+  def refresh(): Unit = children.foreach(_.refresh())
+}
+
+/**
+ * Helper class for (LogicalPlan) attribute resolution. This class indexes 
attributes by their
+ * case-in-sensitive name, and checks potential candidates using the given 
Resolver. Both qualified
--- End diff --

case-insensitive? When you say "the given Resolver", what do you mean by 
"Resolver"? Can we link to the type?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14080: [SPARK-16405] Add metrics and source for external...

2016-07-07 Thread jaceklaskowski

Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14080#discussion_r69981791
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
 ---
@@ -143,4 +179,26 @@ private void checkAuth(TransportClient client, String 
appId) {
 }
   }
 
+  /**
+   * A simple class to wrap all shuffle service wrapper metrics
+   */
+  private class ShuffleMetrics implements MetricSet {
+private final Map allMetrics;
+private final Timer timeDelayForOpenBlockRequest = new Timer();
+private final Timer timeDelayForRegisterExecutorRequest = new Timer();
+private final Meter transferBlockRate = new Meter();
+
+private ShuffleMetrics() {
+  allMetrics = new HashMap<>();
--- End diff --

Will it work with Java 7? I think Spark 2.0 will keep support for the 
version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14093: SPARK-16420: Ensure compression streams are closed.

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14093
  
**[Test build #61926 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61926/consoleFull)**
 for PR 14093 at commit 
[`601f934`](https://github.com/apache/spark/commit/601f934372922b3b68424d3ef5a3cc81fd0f4500).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14088
  
**[Test build #61925 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61925/consoleFull)**
 for PR 14088 at commit 
[`55e66b2`](https://github.com/apache/spark/commit/55e66b21cdcd68861db0f1045186048c54b13153).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14088
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14088
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61925/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14088
  
**[Test build #61925 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61925/consoleFull)**
 for PR 14088 at commit 
[`55e66b2`](https://github.com/apache/spark/commit/55e66b21cdcd68861db0f1045186048c54b13153).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14088: [SPARK-16414] [YARN] Fix bugs for "Can not get user conf...

2016-07-07 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/14088
  
ok to test. shouldn't be hard to add a unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13876: [SPARK-16174][SQL] Improve `OptimizeIn` optimizer to rem...

2016-07-07 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/13876
  
Thank you for review and merging, @cloud-fan and @rxin .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14092
  
**[Test build #3169 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3169/consoleFull)**
 for PR 14092 at commit 
[`b4b02bf`](https://github.com/apache/spark/commit/b4b02bf3879daf9a4532b61a019ea33b0f3ff835).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14092: [SPARK-16419][SQL] EnsureRequirements adds extra Sort to...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14092
  
**[Test build #3168 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3168/consoleFull)**
 for PR 14092 at commit 
[`b4b02bf`](https://github.com/apache/spark/commit/b4b02bf3879daf9a4532b61a019ea33b0f3ff835).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14022: [SPARK-16272][core] Allow config values to reference con...

2016-07-07 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14022
  
**[Test build #61924 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61924/consoleFull)**
 for PR 14022 at commit 
[`392bddc`](https://github.com/apache/spark/commit/392bddc57eaefb09c73902ea041f05705d9498aa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14004: [SPARK-16285][SQL] Implement sentences SQL functions

2016-07-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14004
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61920/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 1 2 3 4 5 6 7 >

301 - 400 of 622 matches

Mail list logo