[GitHub] spark issue #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17912
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76635/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17912
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17912
  
**[Test build #76635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76635/testReport)**
 for PR 17912 at commit 
[`b9e3e47`](https://github.com/apache/spark/commit/b9e3e47706af2b9b09fa73101487d31a00779dc3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17885: [SPARK-20627][PYSPARK] Drop the hadoop distirbution name...

2017-05-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17885
  
Could you post the changes you made in the PR description and explain why 
it resolves PEP-0440? It might help more people understand the impacts of this 
PR by reading the PR description. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17910: [SPARK-20669][ML] LogisticRegression family should be ca...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17910
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17910: [SPARK-20669][ML] LogisticRegression family should be ca...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17910
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76633/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17910: [SPARK-20669][ML] LogisticRegression family should be ca...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17910
  
**[Test build #76633 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76633/testReport)**
 for PR 17910 at commit 
[`33c0f9e`](https://github.com/apache/spark/commit/33c0f9e52c239a6067a535be9c0ce19772d32aef).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-08 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/17910#discussion_r115418289
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -526,7 +526,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase(Locale.ROOT) match {
--- End diff --

I follow the style in `GeneralizedLinearRegression`.
Lower the param in setter can simplify the codes, but it also change the 
output of coresponding getter. What is your opinion? @yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-08 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/17910#discussion_r115418315
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -890,7 +890,7 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
   override def load(path: String): LogisticRegression = super.load(path)
 
   private[classification] val supportedFamilyNames =
-Array("auto", "binomial", 
"multinomial").map(_.toLowerCase(Locale.ROOT))
+Array("auto", "binomial", "multinomial")
--- End diff --

I am not sure about this. If we should keep `toLowerCase` here, we may also 
do this in `GeneralizedLinearRegression` and others


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17869: [SPARK-20609][CORE]Run the SortShuffleSuite unit ...

2017-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17869#discussion_r115417588
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/recommendation/ALSSuite.scala ---
@@ -774,6 +774,7 @@ class ALSCleanerSuite extends SparkFunSuite {
 } finally {
   Utils.deleteRecursively(localDir)
   Utils.deleteRecursively(checkpointDir)
+  Utils.clearLocalRootDirs()
--- End diff --

Could we add before/after for each likewise?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17885: [SPARK-20627][PYSPARK] Drop the hadoop distirbution name...

2017-05-08 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17885
  
Are you referring to https://www.python.org/dev/peps/pep-0440/ ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-08 Thread zsxwing
Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17896#discussion_r115418094
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] {
 }
 
 /**
+ * Ignore event time watermark in batch query, which is only supported in 
Structured Streaming.
+ * TODO: add this rule into analyzer rule list.
+ */
+object CheckEventTimeWatermark extends Rule[LogicalPlan] {
--- End diff --

I see. The current approach is good to me then. Could you rename it, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15259: [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars ...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15259
  
**[Test build #76645 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76645/testReport)**
 for PR 15259 at commit 
[`2bb54b5`](https://github.com/apache/spark/commit/2bb54b569fcaf3c431bf792f594c485064d3cd37).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15259: [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars ...

2017-05-08 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/15259
  
Jenkins, retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17885: [SPARK-20627][PYSPARK] Drop the hadoop distirbution name...

2017-05-08 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/17885
  
If there are no other comments I'm going to merge this tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17303: [SPARK-19112][CORE] add codec for ZStandard

2017-05-08 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17303
  
@Cyan4973 I quickly checked again;
```
scaleFactor: 4
AWS instance: c4.4xlarge

// In this bench, I used `local-cluster` (`local` used in the benchmark 
above)
./bin/spark-shell --master local-cluster[4,4,7500] \
  --conf spark.driver.memory=1g \
  --conf spark.executor.memory=7g \
  --conf spark.io.compression.codec=xxx

--- zstd (level=3)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 36.517211838s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 25.026869575s   

Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 24.370711575s   


--- zstd (level=1)
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 29.654705815s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 20.638918335s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 19.92873075897s

--- lz4
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.422360631s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.38519278s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.779084563s

--- snappy
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.47656952102s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 16.438640631s   

Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 14.949329456s

--- lzf
Running execution q4-v1.4 iteration: 1, StandardRun=true
Execution time: 27.853010073s
Running execution q4-v1.4 iteration: 2, StandardRun=true
Execution time: 17.43123253203s
Running execution q4-v1.4 iteration: 3, StandardRun=true
Execution time: 15.91656989699s
```
`zstd` was still worse than the others.
Not sure though, there might be the winner case where `zstd` overcomes the 
others in more larger data set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17858
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76644/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17858
  
**[Test build #76644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76644/testReport)**
 for PR 17858 at commit 
[`6b1b153`](https://github.com/apache/spark/commit/6b1b153e1ee9ec3e7830158d8f8eb274970929ae).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17858
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76628/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17858
  
**[Test build #76644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76644/testReport)**
 for PR 17858 at commit 
[`6b1b153`](https://github.com/apache/spark/commit/6b1b153e1ee9ec3e7830158d8f8eb274970929ae).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17879
  
**[Test build #76628 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76628/testReport)**
 for PR 17879 at commit 
[`53381ea`](https://github.com/apache/spark/commit/53381ea6ba41cc26ed89a6fc42252f7126198d9f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-08 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17910#discussion_r115416085
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -526,7 +526,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase(Locale.ROOT) match {
--- End diff --

As a general practice, I would recommend moving the 
`.toLowerCase(Locale.ROOT)` into the setter. Then we don't need to invoke the 
`.toLowerCase(Locale.ROOT)` multiple times in the code. (here it happens to be 
once). And we can always assume the $(family) has predictable values in the 
code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-08 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17910#discussion_r115416204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -890,7 +890,7 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
   override def load(path: String): LogisticRegression = super.load(path)
 
   private[classification] val supportedFamilyNames =
-Array("auto", "binomial", 
"multinomial").map(_.toLowerCase(Locale.ROOT))
+Array("auto", "binomial", "multinomial")
--- End diff --

We may need to be careful to remove the map. Since Locale.Root can be some 
special case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17876: [SPARK-20569][SQL] RuntimeReplaceable functions should n...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17876
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76616/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17876: [SPARK-20569][SQL] RuntimeReplaceable functions should n...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17876
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17876: [SPARK-20569][SQL] RuntimeReplaceable functions should n...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17876
  
**[Test build #76616 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76616/testReport)**
 for PR 17876 at commit 
[`601e988`](https://github.com/apache/spark/commit/601e98813f59b98e6a0f10aeea5bfc0e1e6571a1).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17913
  
**[Test build #76643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76643/testReport)**
 for PR 17913 at commit 
[`8cee88e`](https://github.com/apache/spark/commit/8cee88e36092ee568c61a68c5a9ce97cda58839c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17915: [SPARK-20674][SQL] Support registering UserDefinedFuncti...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17915
  
**[Test build #76642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76642/testReport)**
 for PR 17915 at commit 
[`55421ea`](https://github.com/apache/spark/commit/55421ea99a97c6820169a22b1a5bfc00318ac66b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17894: [SPARK-17134][ML] Use level 2 BLAS operations in Logisti...

2017-05-08 Thread VinceShieh
Github user VinceShieh commented on the issue:

https://github.com/apache/spark/pull/17894
  
@hhbyyh performance testing is ongoing, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17894: [SPARK-17134][ML] Use level 2 BLAS operations in ...

2017-05-08 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17894#discussion_r115415823
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1722,25 +1723,22 @@ private class LogisticAggregator(
 var maxMargin = Double.NegativeInfinity
 
 val margins = new Array[Double](numClasses)
+val featureStdArray = new Array[Double](features.size)
--- End diff --

Agree. Still, we will try benchmark on the sparse dataset, if such change 
hurt the performance for sparse data, we will bypass this change for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-08 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17896#discussion_r115415803
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] {
 }
 
 /**
+ * Ignore event time watermark in batch query, which is only supported in 
Structured Streaming.
+ * TODO: add this rule into analyzer rule list.
+ */
+object CheckEventTimeWatermark extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+case EventTimeWatermark(_, _, child) if !child.isStreaming =>
+  logWarning("EventTime watermark is only supported in Structured 
Streaming but found " +
--- End diff --

got 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17865: [SPARK-20456][Docs] Add examples for functions co...

2017-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17865#discussion_r115415748
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1120,12 +1159,12 @@ def from_utc_timestamp(timestamp, tz):
 @since(1.5)
 def to_utc_timestamp(timestamp, tz):
 """
-Given a timestamp, which corresponds to a certain time of day in the 
given timezone, returns
-another timestamp that corresponds to the same time of day in UTC.
+Given a `timestamp`, which corresponds to a time of day in the 
timezone `tz`,
--- End diff --

No, I don't think we have a rule about this up to my knowledge. Thank you 
for the pointers and looking into this. Let's follow the majority then for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17865: [SPARK-20456][Docs] Add examples for functions co...

2017-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17865#discussion_r115415714
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -1120,12 +1159,12 @@ def from_utc_timestamp(timestamp, tz):
 @since(1.5)
 def to_utc_timestamp(timestamp, tz):
 """
-Given a timestamp, which corresponds to a certain time of day in the 
given timezone, returns
-another timestamp that corresponds to the same time of day in UTC.
+Given a `timestamp`, which corresponds to a time of day in the 
timezone `tz`,
--- End diff --

No, I think we have a rule about this up to my knowledge. Thank you for the 
pointers and looking into this. Let's follow the majority then for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17896: [SPARK-20373][SQL][SS] Batch queries with 'Datase...

2017-05-08 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17896#discussion_r115415668
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -2457,6 +2457,19 @@ object CleanupAliases extends Rule[LogicalPlan] {
 }
 
 /**
+ * Ignore event time watermark in batch query, which is only supported in 
Structured Streaming.
+ * TODO: add this rule into analyzer rule list.
+ */
+object CheckEventTimeWatermark extends Rule[LogicalPlan] {
--- End diff --

@zsxwing This pr does some prepare work before we add 
`EliminateEventTimeWatermark ` into `Analyzer.batches`. Could you please take a 
review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17915: [SPARK-20674][SQL] Support registering UserDefine...

2017-05-08 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17915

[SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF

## What changes were proposed in this pull request?
For some reason we don't have an API to register UserDefinedFunction as 
named UDF. It is a no brainer to add one, in addition to the existing register 
functions we have.

## How was this patch tested?
Added a test case in UDFSuite for the new API.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20674

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17915.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17915






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17894: [SPARK-17134][ML] Use level 2 BLAS operations in ...

2017-05-08 Thread VinceShieh
Github user VinceShieh commented on a diff in the pull request:

https://github.com/apache/spark/pull/17894#discussion_r115415580
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -23,6 +23,7 @@ import scala.collection.mutable
 
 import breeze.linalg.{DenseVector => BDV}
 import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => 
BreezeLBFGS, LBFGSB => BreezeLBFGSB, OWLQN => BreezeOWLQN}
+import com.github.fommil.netlib.BLAS.{getInstance => blas}
--- End diff --

MLLib BLAS doesnt have ger support, we might, of course, add an API support 
in MLLib Blas for this issue


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17858: [SPARK-20594][SQL]The staging directory should be...

2017-05-08 Thread zuotingbing
Github user zuotingbing commented on a diff in the pull request:

https://github.com/apache/spark/pull/17858#discussion_r115415586
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -97,12 +97,23 @@ case class InsertIntoHiveTable(
 val inputPathUri: URI = inputPath.toUri
 val inputPathName: String = inputPathUri.getPath
 val fs: FileSystem = inputPath.getFileSystem(hadoopConf)
-val stagingPathName: String =
+var stagingPathName: String =
   if (inputPathName.indexOf(stagingDir) == -1) {
 new Path(inputPathName, stagingDir).toString
   } else {
 inputPathName.substring(0, inputPathName.indexOf(stagingDir) + 
stagingDir.length)
   }
+
+// SPARK-20594: The staging directory should be a child directory 
starts with "." to avoid
+// being deleted if we set hive.exec.stagingdir under the table 
directory.
+if (FileUtils.isSubDir(new Path(stagingPathName), inputPath, fs)
+  && !stagingPathName.stripPrefix(inputPathName).startsWith(".")) {
--- End diff --

Sorry i do not follow your logic. Correct me if I'm wrong, but isn't the 
logic of dropping the created staging directory was already there before with 
`fs.deleteOnExit(dir)`?
As @cloud-fan said this patch seems a valid workaround in Spark SQL for 
this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17914: [SPARK-20673][ML] LDA `optimizer` do not really support ...

2017-05-08 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/17914
  
@yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-08 Thread uncleGen
Github user uncleGen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17913#discussion_r115415483
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala
 ---
@@ -64,8 +64,20 @@ case class StreamingRelationExec(sourceName: String, 
output: Seq[Attribute]) ext
   }
 }
 
-object StreamingExecutionRelation {
-  def apply(source: Source): StreamingExecutionRelation = {
-StreamingExecutionRelation(source, source.schema.toAttributes)
+case class StreamingRelationWrapper(child: LogicalPlan) extends UnaryNode {
+  override def isStreaming: Boolean = true
+  override def output: Seq[Attribute] = child.output
+}
+
--- End diff --

Add a new `StreamingRelationWrapper` relation to wrap the internal relation 
in each trigger. It keeps the `isStreaming` property.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17896: [SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataF...

2017-05-08 Thread uncleGen
Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/17896
  
Depends upon: 
[SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17865: [SPARK-20456][Docs] Add examples for functions co...

2017-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17865#discussion_r115415128
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -153,7 +173,7 @@ def _():
 # math functions that take two arguments as input
 _binary_mathfunctions = {
 'atan2': 'Returns the angle theta from the conversion of rectangular 
coordinates (x, y) to' +
- 'polar coordinates (r, theta).',
+ 'polar coordinates (r, theta). Units in radians.',
--- End diff --

I see. What do you think about adding this in `:param`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17913: [SPARK-20672][SS] Keep the `isStreaming` property in tri...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17913
  
**[Test build #76640 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76640/testReport)**
 for PR 17913 at commit 
[`20648d9`](https://github.com/apache/spark/commit/20648d99b1b95ea074be56708f13901bba2ee10d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17644: [SPARK-17729] [SQL] Enable creating hive bucketed tables

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17644
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17644: [SPARK-17729] [SQL] Enable creating hive bucketed tables

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17644
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76613/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17901: [SPARK-20639][SQL] Add single argument support for to_ti...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17901
  
**[Test build #76641 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76641/testReport)**
 for PR 17901 at commit 
[`fc02460`](https://github.com/apache/spark/commit/fc02460c5d014c573631f3b62cd6b62f5a46c261).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17914: [SPARK-20673][ML] LDA `optimizer` do not really support ...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17914
  
**[Test build #76639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76639/testReport)**
 for PR 17914 at commit 
[`b48f760`](https://github.com/apache/spark/commit/b48f7601408a005e773216bc67935c73f7f59324).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17644: [SPARK-17729] [SQL] Enable creating hive bucketed tables

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17644
  
**[Test build #76613 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76613/testReport)**
 for PR 17644 at commit 
[`49040e8`](https://github.com/apache/spark/commit/49040e83217a787f7a995f9da941617885e10821).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17914: [SPARK-20673][ML] LDA `optimizer` do not really s...

2017-05-08 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/17914

[SPARK-20673][ML] LDA `optimizer` do not really support case insensitive 

## What changes were proposed in this pull request?
cast to loweer case in `getOptimizer`

## How was this patch tested?
updated tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark lda_optimizer_case

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17914.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17914


commit b48f7601408a005e773216bc67935c73f7f59324
Author: Zheng RuiFeng 
Date:   2017-05-09T06:17:51Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17666
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17666
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76615/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17666: [SPARK-20311][SQL] Support aliases for table value funct...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17666
  
**[Test build #76615 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76615/testReport)**
 for PR 17666 at commit 
[`81bef3b`](https://github.com/apache/spark/commit/81bef3ba21cb0c3e4b36f3fc492d9ab3a3124829).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17913: [SPARK-20672][SS] Keep the `isStreaming` property...

2017-05-08 Thread uncleGen
GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17913

[SPARK-20672][SS] Keep the `isStreaming` property in triggerLogicalPlan in 
Structured Streaming

## What changes were proposed in this pull request?

In Structured Streaming, the "isStreaming" property will be eliminated in 
each triggerLogicalPlan. Then, some rules will be applied to this 
triggerLogicalPlan mistakely. So, we should refactor existing code to better 
execute batch query and ss query.

## How was this patch tested?

existing ut.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20672

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17913.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17913


commit d1c4cbf0fa369db993855ef3f63b05561cf6662a
Author: uncleGen 
Date:   2017-05-09T06:01:51Z

Keep the `streaming` property in triggerLogicalPlan in Structured Streaming

commit 20648d99b1b95ea074be56708f13901bba2ee10d
Author: uncleGen 
Date:   2017-05-09T06:18:50Z

update




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17901: [SPARK-20639][SQL] Add single argument support fo...

2017-05-08 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17901#discussion_r115414505
  
--- Diff: R/pkg/R/functions.R ---
@@ -1757,7 +1757,8 @@ setMethod("toRadians",
 #' 
\url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
 #' If the string cannot be parsed according to the specified format (or 
default),
 #' the value of the column will be null.
-#' The default format is '-MM-dd'.
+#' By default, it follows casting rules to a DateType if the format is 
omitted
+#' (equivalent with \code{cast(df$x, "date")}).
--- End diff --

@felixcheung, I added an example here. Would this be enough?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17879#discussion_r115414436
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -59,6 +59,29 @@ private[feature] trait StringIndexerBase extends Params 
with HasInputCol with Ha
   @Since("1.6.0")
   def getHandleInvalid: String = $(handleInvalid)
 
+  /**
+   * Param for how to order labels of string column. The first label after 
ordering is assigned
+   * an index of 0.
+   * Options are:
+   *   - 'frequencyDesc': descending order by label frequency (most 
frequent label assigned 0)
+   *   - 'frequencyAsc': ascending order by label frequency (least 
frequent label assigned 0)
+   *   - 'alphabetDesc': descending alphabetical order
+   *   - 'alphabetAsc': ascending alphabetical order
+   * Default is 'frequencyDesc'.
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val stringOrderType: Param[String] = new Param(this, 
"stringOrderType",
+"how to order labels of string column. " +
+"The first label after ordering is assigned an index of 0. " +
+s"Supported options: 
${StringIndexer.supportedStringOrderType.mkString(", ")}.",
+ParamValidators.inArray(StringIndexer.supportedStringOrderType))
--- End diff --

@felixcheung  Right. It does not quite make sense to be case-sensitive now 
given that we now use camel case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17901: [SPARK-20639][SQL] Add single argument support for to_ti...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17901
  
**[Test build #76638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76638/testReport)**
 for PR 17901 at commit 
[`b6f867c`](https://github.com/apache/spark/commit/b6f867cd87e46ca2daf74eabce14b735a962c9a4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-08 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17879#discussion_r115414165
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -59,6 +59,29 @@ private[feature] trait StringIndexerBase extends Params 
with HasInputCol with Ha
   @Since("1.6.0")
   def getHandleInvalid: String = $(handleInvalid)
 
+  /**
+   * Param for how to order labels of string column. The first label after 
ordering is assigned
+   * an index of 0.
+   * Options are:
+   *   - 'frequencyDesc': descending order by label frequency (most 
frequent label assigned 0)
+   *   - 'frequencyAsc': ascending order by label frequency (least 
frequent label assigned 0)
+   *   - 'alphabetDesc': descending alphabetical order
+   *   - 'alphabetAsc': ascending alphabetical order
+   * Default is 'frequencyDesc'.
+   *
+   * @group param
+   */
+  @Since("2.3.0")
+  final val stringOrderType: Param[String] = new Param(this, 
"stringOrderType",
+"how to order labels of string column. " +
+"The first label after ordering is assigned an index of 0. " +
+s"Supported options: 
${StringIndexer.supportedStringOrderType.mkString(", ")}.",
+ParamValidators.inArray(StringIndexer.supportedStringOrderType))
--- End diff --

so we are going to case sensitive then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17879: [SPARK-20619][ML] StringIndexer supports multiple...

2017-05-08 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17879#discussion_r115413770
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -131,6 +167,12 @@ object StringIndexer extends 
DefaultParamsReadable[StringIndexer] {
   private[feature] val KEEP_INVALID: String = "keep"
   private[feature] val supportedHandleInvalids: Array[String] =
 Array(SKIP_INVALID, ERROR_INVALID, KEEP_INVALID)
+  private[feature] val FREQ_DESC: String = "frequency_desc"
+  private[feature] val FREQ_ASC: String = "frequency_asc"
+  private[feature] val ALPHABET_DESC: String = "alphabet_desc"
+  private[feature] val ALPHABET_ASC: String = "alphabet_asc"
--- End diff --

@gatorsmile Thanks much for the suggestion. Changed them to lowerCamelCase. 
@felixcheung Any additional suggestions? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17901: [SPARK-20639][SQL] Add single argument support for to_ti...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17901
  
**[Test build #76636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76636/testReport)**
 for PR 17901 at commit 
[`497a229`](https://github.com/apache/spark/commit/497a22965af3a74e89c73b60667ab19fecb0af39).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for LinearSV...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17862
  
**[Test build #76637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76637/testReport)**
 for PR 17862 at commit 
[`8a7c10f`](https://github.com/apache/spark/commit/8a7c10f5bc0d7234ed6e156c98f04bddb7a37204).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76624/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17879
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17879
  
**[Test build #76624 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76624/testReport)**
 for PR 17879 at commit 
[`07198d9`](https://github.com/apache/spark/commit/07198d9bb45a54d3c257ad37e772cc31154ffcb6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115409386
  
--- Diff: core/src/main/scala/org/apache/spark/memory/MemoryManager.scala 
---
@@ -54,7 +54,8 @@ private[spark] abstract class MemoryManager(
   onHeapStorageMemoryPool.incrementPoolSize(onHeapStorageMemory)
   onHeapExecutionMemoryPool.incrementPoolSize(onHeapExecutionMemory)
 
-  protected[this] val maxOffHeapMemory = 
conf.getSizeAsBytes("spark.memory.offHeap.size", 0)
+  protected[this] val maxOffHeapMemory =
+conf.getSizeAsBytes("spark.memory.offHeap.size", 384 * 1024 * 1024)
--- End diff --

Maybe I missed the discussion, why is this changed ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115411681
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala 
---
@@ -154,15 +164,24 @@ final class ShuffleBlockFetcherIterator(
 while (iter.hasNext) {
   val result = iter.next()
   result match {
-case SuccessFetchResult(_, address, _, buf, _) =>
+case SuccessFetchResult(_, address, size, buf, _) =>
   if (address != blockManager.blockManagerId) {
 shuffleMetrics.incRemoteBytesRead(buf.size)
 shuffleMetrics.incRemoteBlocksFetched(1)
   }
   buf.release()
+  freeMemory(size)
 case _ =>
   }
 }
+shuffleFiles.foreach { shuffleFile =>
+  try {
+shuffleFile.delete()
+  } catch {
+case ioe: IOException =>
+  logError(s"Failed to cleanup ${shuffleFile.getAbsolutePath}.", 
ioe)
--- End diff --

`IOException` is not thrown by delete - but it can return `false` to 
indicate delete failure.
The log message (INFO would do btw) should be on `delete()` returning 
`false`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115410895
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -193,8 +206,18 @@ private[spark] object HighlyCompressedMapStatus {
 } else {
   0
 }
+val hugeBlockSizes = ArrayBuffer[Tuple2[Int, Byte]]()
+if (numNonEmptyBlocks > 0) {
+  uncompressedSizes.zipWithIndex.foreach {
+case (size, reduceId) =>
+  if (size > 2 * avgSize) {
--- End diff --

This should be configurable in two respects.
* minimum size before we consider something a large block : if average is 
10kb, and some blocks are > 20kb, spilling them to disk would be highly 
suboptimal. (Unless I missed that check somewhere else).
* The fraction '2' should also be configurable - some deployments might be 
ok with high memory usage (machines provisioned accordingly) while others might 
need it to be more aggressive and lower.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115409772
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -128,41 +130,52 @@ private[spark] class CompressedMapStatus(
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizesArray sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+private[this] var hugeBlockSizesArray: Array[Tuple2[Int, Byte]])
   extends MapStatus with Externalizable {
 
+  @transient var hugeBlockSizes: Map[Int, Byte] =
--- End diff --

`private` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115410407
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -128,41 +130,52 @@ private[spark] class CompressedMapStatus(
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizesArray sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+private[this] var hugeBlockSizesArray: Array[Tuple2[Int, Byte]])
   extends MapStatus with Externalizable {
 
+  @transient var hugeBlockSizes: Map[Int, Byte] =
+if (hugeBlockSizesArray == null) null else hugeBlockSizesArray.toMap
+
   // loc could be null when the default constructor is called during 
deserialization
   require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
 "Average size can only be zero for map stages that produced no output")
 
-  protected def this() = this(null, -1, null, -1)  // For deserialization 
only
+  protected def this() = this(null, -1, null, -1, null)  // For 
deserialization only
 
   override def location: BlockManagerId = loc
 
   override def getSizeForBlock(reduceId: Int): Long = {
 if (emptyBlocks.contains(reduceId)) {
   0
 } else {
-  avgSize
+  hugeBlockSizes.get(reduceId) match {
--- End diff --

NPE


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115410381
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -128,41 +130,52 @@ private[spark] class CompressedMapStatus(
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizesArray sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+private[this] var hugeBlockSizesArray: Array[Tuple2[Int, Byte]])
   extends MapStatus with Externalizable {
 
+  @transient var hugeBlockSizes: Map[Int, Byte] =
+if (hugeBlockSizesArray == null) null else hugeBlockSizesArray.toMap
+
   // loc could be null when the default constructor is called during 
deserialization
   require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
 "Average size can only be zero for map stages that produced no output")
 
-  protected def this() = this(null, -1, null, -1)  // For deserialization 
only
+  protected def this() = this(null, -1, null, -1, null)  // For 
deserialization only
 
   override def location: BlockManagerId = loc
 
   override def getSizeForBlock(reduceId: Int): Long = {
 if (emptyBlocks.contains(reduceId)) {
   0
 } else {
-  avgSize
+  hugeBlockSizes.get(reduceId) match {
+case Some(size) => MapStatus.decompressSize(size)
+case None => avgSize
+  }
 }
   }
 
   override def writeExternal(out: ObjectOutput): Unit = 
Utils.tryOrIOException {
 loc.writeExternal(out)
 emptyBlocks.writeExternal(out)
 out.writeLong(avgSize)
+out.writeObject(hugeBlockSizesArray)
   }
 
   override def readExternal(in: ObjectInput): Unit = 
Utils.tryOrIOException {
 loc = BlockManagerId(in)
 emptyBlocks = new RoaringBitmap()
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
+hugeBlockSizesArray = in.readObject().asInstanceOf[Array[Tuple2[Int, 
Byte]]]
--- End diff --

This can be null, and so need to be handled appropriately below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115412242
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala 
---
@@ -137,6 +146,7 @@ final class ShuffleBlockFetcherIterator(
 // Release the current buffer if necessary
 if (currentResult != null) {
   currentResult.buf.release()
+  freeMemory(currentResult.size)
--- End diff --

Only if in memory and not on disk ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115412049
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/ShuffleBlockFetcherIterator.scala 
---
@@ -154,15 +164,24 @@ final class ShuffleBlockFetcherIterator(
 while (iter.hasNext) {
   val result = iter.next()
   result match {
-case SuccessFetchResult(_, address, _, buf, _) =>
+case SuccessFetchResult(_, address, size, buf, _) =>
   if (address != blockManager.blockManagerId) {
 shuffleMetrics.incRemoteBytesRead(buf.size)
 shuffleMetrics.incRemoteBlocksFetched(1)
   }
   buf.release()
+  freeMemory(size)
--- End diff --

Only if it was *not* fetched to disk - if it was spilled to disk, we did 
not acquire memory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115409268
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java
 ---
@@ -126,4 +151,39 @@ private void failRemainingBlocks(String[] 
failedBlockIds, Throwable e) {
   }
 }
   }
+
+  private class DownloadCallback implements StreamCallback {
+
+private WritableByteChannel channel = null;
+private File targetFile = null;
+private int chunkIndex;
+
+public DownloadCallback(File targetFile, int chunkIndex) throws 
IOException {
+  this.targetFile = targetFile;
+  this.channel = Channels.newChannel(new FileOutputStream(targetFile));
+  this.chunkIndex = chunkIndex;
+}
+
+@Override
+public void onData(String streamId, ByteBuffer buf) throws IOException 
{
+  channel.write(buf);
+}
+
+@Override
+public void onComplete(String streamId) throws IOException {
+  channel.close();
+  ManagedBuffer buffer = new FileSegmentManagedBuffer(
--- End diff --

After consumption of each corresponding ManagedBuffer, we should make an 
attempt to remove the corresponding file : should be fairly straightforward, no 
? (override release ?)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115407998
  
--- Diff: 
common/network-common/src/main/java/org/apache/spark/network/server/OneForOneStreamManager.java
 ---
@@ -95,6 +95,14 @@ public ManagedBuffer getChunk(long streamId, int 
chunkIndex) {
   }
 
   @Override
+  public ManagedBuffer openStream(String streamChunkId) {
+String[] array = streamChunkId.split("_");
--- End diff --

Instead of spread the parsing logic, it is better to externalize this into 
a pair of methods - one to create streamChunkId given streamId and chunkIndex 
and another to retrieve it.
If we have to change delimiter or add other logic, it will be more easier 
to manage the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115409001
  
--- Diff: 
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/OneForOneBlockFetcher.java
 ---
@@ -126,4 +149,38 @@ private void failRemainingBlocks(String[] 
failedBlockIds, Throwable e) {
   }
 }
   }
+
+  private class DownloadCallback implements StreamCallback {
+
+private WritableByteChannel channel = null;
+private File targetFile = null;
+private int chunkIndex;
+
+public DownloadCallback(File targetFile, int chunkIndex) throws 
IOException {
+  this.targetFile = targetFile;
+  this.channel = Channels.newChannel(new FileOutputStream(targetFile));
+  this.chunkIndex = chunkIndex;
+}
+
+@Override
+public void onData(String streamId, ByteBuffer buf) throws IOException 
{
+  channel.write(buf);
--- End diff --

As an impl detail (since channel is a FOS), this will work - but in 
general, channel.write() need not write buf.remain(); which actually breaks 
spark code iirc - since it expects odData to completely consume the data.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115409843
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -128,41 +130,52 @@ private[spark] class CompressedMapStatus(
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizesArray sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+private[this] var hugeBlockSizesArray: Array[Tuple2[Int, Byte]])
--- End diff --

Why does hugeBlockSizesArray exist ? Is it for efficient serializable ?
If yes, the then converting it into (Array[Int], Array[Byte]) would be 
better (with each array written directly - not as Tupe2 object) - more below 
though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16989: [SPARK-19659] Fetch big blocks to disk when shuff...

2017-05-08 Thread mridulm
Github user mridulm commented on a diff in the pull request:

https://github.com/apache/spark/pull/16989#discussion_r115410136
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala ---
@@ -128,41 +130,52 @@ private[spark] class CompressedMapStatus(
  * @param numNonEmptyBlocks the number of non-empty blocks
  * @param emptyBlocks a bitmap tracking which blocks are empty
  * @param avgSize average size of the non-empty blocks
+ * @param hugeBlockSizesArray sizes of huge blocks by their reduceId.
  */
 private[spark] class HighlyCompressedMapStatus private (
 private[this] var loc: BlockManagerId,
 private[this] var numNonEmptyBlocks: Int,
 private[this] var emptyBlocks: RoaringBitmap,
-private[this] var avgSize: Long)
+private[this] var avgSize: Long,
+private[this] var hugeBlockSizesArray: Array[Tuple2[Int, Byte]])
   extends MapStatus with Externalizable {
 
+  @transient var hugeBlockSizes: Map[Int, Byte] =
+if (hugeBlockSizesArray == null) null else hugeBlockSizesArray.toMap
+
   // loc could be null when the default constructor is called during 
deserialization
   require(loc == null || avgSize > 0 || numNonEmptyBlocks == 0,
 "Average size can only be zero for map stages that produced no output")
 
-  protected def this() = this(null, -1, null, -1)  // For deserialization 
only
+  def this() = this(null, -1, null, -1, null)  // For deserialization only
 
   override def location: BlockManagerId = loc
 
   override def getSizeForBlock(reduceId: Int): Long = {
 if (emptyBlocks.contains(reduceId)) {
   0
 } else {
-  avgSize
+  hugeBlockSizes.get(reduceId) match {
+case Some(size) => MapStatus.decompressSize(size)
+case None => avgSize
+  }
 }
   }
 
   override def writeExternal(out: ObjectOutput): Unit = 
Utils.tryOrIOException {
 loc.writeExternal(out)
 emptyBlocks.writeExternal(out)
 out.writeLong(avgSize)
+out.writeObject(hugeBlockSizesArray)
   }
 
   override def readExternal(in: ObjectInput): Unit = 
Utils.tryOrIOException {
 loc = BlockManagerId(in)
 emptyBlocks = new RoaringBitmap()
 emptyBlocks.readExternal(in)
 avgSize = in.readLong()
+hugeBlockSizesArray = in.readObject().asInstanceOf[Array[Tuple2[Int, 
Byte]]]
+hugeBlockSizes = hugeBlockSizesArray.toMap
--- End diff --

Object creation (this()) has already happened - readExternal is restoring 
the state from the stream. So we need to keep this @cloud-fan 

One issue I have here is that we are duplicating the information between 
hugeBlockSizesArray and hugeBlockSizes.
I would prefer if we dropped hugeBlockSizesArray entirely (other than as 
constructor param we initialize state from).
This will actually result in more efficient serde at the cost of manually 
doing the serde for hugeBlockSizes, and handle all the corner cases (like avoid 
need for any null check, etc).
For serialization: write length, loop - write key as int, write value as 
byte; for deserialization, the reverse.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17912
  
**[Test build #76635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76635/testReport)**
 for PR 17912 at commit 
[`b9e3e47`](https://github.com/apache/spark/commit/b9e3e47706af2b9b09fa73101487d31a00779dc3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76621/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17879
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17879: [SPARK-20619][ML] StringIndexer supports multiple ways t...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17879
  
**[Test build #76621 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76621/testReport)**
 for PR 17879 at commit 
[`ff9b1d6`](https://github.com/apache/spark/commit/ff9b1d66873eb8cad1a4a13f323555da2706a849).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17911: [SPARK-20668][SQL] Modify ScalaUDF to handle nullability...

2017-05-08 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/17911
  
cc @gatorsmile


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/17912
  
cc @srowen @jkbradley @felixcheung 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17858
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76617/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17858
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17858: [SPARK-20594][SQL]The staging directory should be a chil...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17858
  
**[Test build #76617 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76617/testReport)**
 for PR 17858 at commit 
[`6b22d3e`](https://github.com/apache/spark/commit/6b22d3ea694c4133965ddface73c52c3566cd156).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17912: [SPARK-20670] [ML] Simplify FPGrowth transform

2017-05-08 Thread hhbyyh
GitHub user hhbyyh opened a pull request:

https://github.com/apache/spark/pull/17912

[SPARK-20670] [ML] Simplify FPGrowth transform

## What changes were proposed in this pull request?

As suggested by Sean Owen in https://github.com/apache/spark/pull/17130, 
the transform code in FPGrowthModel can be simplified.

As I tested on some public dataset http://fimi.ua.ac.be/data/, the 
performance of the new transform code is even or better than the old 
implementation.

## How was this patch tested?

Existing unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hhbyyh/spark fpgrowthTransform

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17912.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17912






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16985: [SPARK-19122][SQL] Unnecessary shuffle+sort added if joi...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16985
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16985: [SPARK-19122][SQL] Unnecessary shuffle+sort added if joi...

2017-05-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16985
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/76614/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17905
  
**[Test build #76634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76634/testReport)**
 for PR 17905 at commit 
[`b37a760`](https://github.com/apache/spark/commit/b37a760417ea5f9b958a7329dbccd110478821ff).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tabl...

2017-05-08 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17905


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17910: [SPARK-20669][ML] LogisticRegression family should be ca...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17910
  
**[Test build #76633 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76633/testReport)**
 for PR 17910 at commit 
[`33c0f9e`](https://github.com/apache/spark/commit/33c0f9e52c239a6067a535be9c0ce19772d32aef).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16985: [SPARK-19122][SQL] Unnecessary shuffle+sort added if joi...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16985
  
**[Test build #76614 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76614/testReport)**
 for PR 16985 at commit 
[`e202ac1`](https://github.com/apache/spark/commit/e202ac1eda5fd1be3e466eea8975a1b0af54129f).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17911: [SPARK-20668][SQL] Modify ScalaUDF to handle nullability...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17911
  
**[Test build #76632 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76632/testReport)**
 for PR 17911 at commit 
[`120c862`](https://github.com/apache/spark/commit/120c862bada2e8a574f29ea4eb4434a528d59b3b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17911: [SPARK-20668][SQL] Modify ScalaUDF to handle null...

2017-05-08 Thread ueshin
GitHub user ueshin opened a pull request:

https://github.com/apache/spark/pull/17911

[SPARK-20668][SQL] Modify ScalaUDF to handle nullability.

## What changes were proposed in this pull request?

When registering Scala UDF, we can know if the udf will return nullable 
value or not. `ScalaUDF` and related classes should handle the nullability.

## How was this patch tested?

Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ueshin/apache-spark issues/SPARK-20668

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17911.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17911


commit 120c862bada2e8a574f29ea4eb4434a528d59b3b
Author: Takuya UESHIN 
Date:   2017-05-05T04:17:18Z

Modify ScalaUDF to handle nullability.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17905
  
merged to master/2.2


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17910: [SPARK-20669][ML] LogisticRegression family shoul...

2017-05-08 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/17910

[SPARK-20669][ML] LogisticRegression family should be case insensitive 

## What changes were proposed in this pull request?
make param `family` case insensitive 

## How was this patch tested?
updated tests


@yanboliang 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark lr_family_lowercase

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17910.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17910


commit 33c0f9e52c239a6067a535be9c0ce19772d32aef
Author: Zheng RuiFeng 
Date:   2017-05-09T05:43:13Z

create pr




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17905: [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames(...

2017-05-08 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/17905
  
ok Jenkins passes, I'm going to merge this in since there are a bunch of PR 
failing because of this, even when they say it's up-to-date with master.
I'm going to investigate further though.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15435: [SPARK-17139][ML] Add model summary for MultinomialLogis...

2017-05-08 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15435
  
**[Test build #76631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/76631/testReport)**
 for PR 15435 at commit 
[`449782a`](https://github.com/apache/spark/commit/449782a36ed139919bec6b114938590a383eaf43).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >