date:20160911

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-11 Thread djvulee

Github user djvulee commented on the issue:

https://github.com/apache/spark/pull/15052
  
@srowen  @davies  mind taking a look? This PR is very simple.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/9
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65237/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/9
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/9
  
**[Test build #65237 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65237/consoleFull)**
 for PR 9 at commit 
[`78ed9a1`](https://github.com/apache/spark/commit/78ed9a183e123f38929bf2df100c8c1cae375093).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65236/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15045
  
**[Test build #65236 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65236/consoleFull)**
 for PR 15045 at commit 
[`f53ad51`](https://github.com/apache/spark/commit/f53ad51cde74429dcd45505d89459d4a9d3a64cb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-11 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78316060
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

Both `isBinaryClassification` and `isMultinomial` can be true when 
`numClasses == 2`.  I think it's better

```scala
val isMultinomial = $(family) match {
  case "binomial" => 
 require(numClasses == 1 || numClasses == 2, s"Binomial family only 
supports 1 or 2 " +
 s"outcome classes but found $numClasses.")
 false
  case "multinomial" => true
  case "auto" => numClasses > 2
  case other => throw new IllegalArgumentException(s"Unsupported family: 
$other")
}
val isBinomial = !isMultinomial
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13758
  
**[Test build #65239 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65239/consoleFull)**
 for PR 13758 at commit 
[`deb363a`](https://github.com/apache/spark/commit/deb363afba6b8b3d2bd82b230ec132eb637c43c6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78315909
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
+   tol = 1E-6, stepSize = 0.03, seed = -763139545) {
--- End diff --

yeah, it is a problem.
now I consider a better way:

we give the `seed` parameter default value `null`
`MultilayerPerceptronClassifierWrapper.fit` add a `null` check for `seed` 
parameter,
if it is null, then do not call `MultilayerPerceptronClassifier.setSeed`
so it will automatically use the default seed.

how do you think about it ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78315763
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

all right, I will add `layers` parameter validation check and move this 
parameter to front. thanks!

and, the `layers` parameters must be set because it determine the structure 
of this classifier.
if we do not set `layers` or set it as a null list `c()`, training will 
throw exception...
you can check the  `MultilayerPerceptronClassifierSuite` input validation 
test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13758
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65238/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13758
  
**[Test build #65238 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65238/consoleFull)**
 for PR 13758 at commit 
[`56d6730`](https://github.com/apache/spark/commit/56d6730276e9270d3be10be77ab22da856cbac45).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class GenericArrayData(val array: Array[Any],`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/13758
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-11 Thread dbtsai

Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78315600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -261,6 +299,7 @@ class LogisticRegression @Since("1.2.0") (
* If the dimensions of features or the number of partitions are large,
* this param could be adjusted to a larger size.
* Default is 2.
+ *
--- End diff --

the indentation is off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13758: [SPARK-16043][SQL] Prepare GenericArrayData implementati...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/13758
  
**[Test build #65238 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65238/consoleFull)**
 for PR 13758 at commit 
[`56d6730`](https://github.com/apache/spark/commit/56d6730276e9270d3be10be77ab22da856cbac45).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15038: [SPARK-17486] Remove unused TaskMetricsUIData.upd...

2016-09-11 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15038


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7266: [SPARK-8764][ML] string indexer should take option to han...

2016-09-11 Thread miro-balaz

Github user miro-balaz commented on the issue:

https://github.com/apache/spark/pull/7266
  
thank you for directions

On Monday, 12 September 2016, Holden Karau  wrote:

> @miro-balaz  : This probably isn't the
> best place for a new feature request - but if you head over to the ASF 
JIRA
> you can create a new ticket and cc the people who worked on this.
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/9
  
**[Test build #65237 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65237/consoleFull)**
 for PR 9 at commit 
[`78ed9a1`](https://github.com/apache/spark/commit/78ed9a183e123f38929bf2df100c8c1cae375093).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15038: [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlo...

2016-09-11 Thread zsxwing

Github user zsxwing commented on the issue:

https://github.com/apache/spark/pull/15038
  
LGTM. Merging to master and 2.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14452
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65235/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14452
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14452
  
**[Test build #65235 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65235/consoleFull)**
 for PR 14452 at commit 
[`64ff37b`](https://github.com/apache/spark/commit/64ff37bcf8cbef34f9f0dbb87e5c33d20b6e04da).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14912: [SPARK-17357][SQL] Fix current predicate pushdown

2016-09-11 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14912
  
To maintain the predicate sets may increase much complexity as I can think. 
I don't know how big the set could be. But once you change one of the 
predicates, you need to construct all equivalent predicates in the set too. I 
think we can maintain CNF and simplification predicates. CNF should be enough 
to push down predicates and simplification predicate can be used in Filter 
execution.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15045
  
**[Test build #65236 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65236/consoleFull)**
 for PR 15045 at commit 
[`f53ad51`](https://github.com/apache/spark/commit/f53ad51cde74429dcd45505d89459d4a9d3a64cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15045
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14947: [SPARK-17388][SQL] Support for inferring type date/times...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14947
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65234/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14947: [SPARK-17388][SQL] Support for inferring type date/times...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14947
  
**[Test build #65234 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65234/consoleFull)**
 for PR 14947 at commit 
[`e9dea77`](https://github.com/apache/spark/commit/e9dea77d0b2fdf08cff0ad30cb081ae48c90ba94).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14947: [SPARK-17388][SQL] Support for inferring type date/times...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14947
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78310476
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
+   tol = 1E-6, stepSize = 0.03, seed = -763139545) {
--- End diff --

hmm, this seems rather fragile? do you think there's another way to do this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78310368
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

If the goal is to require layers to have a value (I didn't realize this 
from our PR description), then we should have layers as the 2nd parameter 
(after data) without any default value.

We should also make sure when layers is later coerced to array that its 
values are coerced into integer?

```
> a <- list(1, 2, "a")
> as.integer(a)
[1]  1  2 NA
Warning message:
NAs introduced by coercion
```
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14452
  
**[Test build #65235 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65235/consoleFull)**
 for PR 14452 at commit 
[`64ff37b`](https://github.com/apache/spark/commit/64ff37bcf8cbef34f9f0dbb87e5c33d20b6e04da).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14988: [SPARK-17425][SQL] Override sameResult in HiveTab...

2016-09-11 Thread watermen

Github user watermen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14988#discussion_r78309372
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala
 ---
@@ -164,4 +164,11 @@ case class HiveTableScanExec(
   }
 
   override def output: Seq[Attribute] = attributes
+
+  override def sameResult(plan: SparkPlan): Boolean = plan match {
--- End diff --

`ReuseExchange` work in parquet/orc format, because `FileSourceScanExec` 
has override the `sameResult`.
if `left.cleanArgs == right.cleanArgs` return false, we can't run the next 
`(left.children, right.children).zipped.forall(_ sameResult _)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78309230
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

@felixcheung but `layers = c()` the `c()` is an invalid value for `layers` 
parameter so I think it is better not to specify the default value for 
`layers` so user must specify this parameter. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15048
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15048
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65233/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15048
  
**[Test build #65233 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65233/consoleFull)**
 for PR 15048 at commit 
[`ae335ae`](https://github.com/apache/spark/commit/ae335ae05dda586a86e39c82ce4f8cdcf5aaa6d0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Command extends LeafNode `
  * `trait RunnableCommand extends logical.Command `
  * `case class CreateTable(`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78308116
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

@shivaram yeah..but I think change the parameter order may break the API 
compatibility ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14388: [SPARK-16362][SQL] Support ArrayType and StructType in v...

2016-09-11 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/14388
  
@mallman Not yet. I am working on another PR recently. I will return back 
when that is solved.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78307552
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
+   tol = 1E-6, stepSize = 0.03, seed = -763139545) {
--- End diff --

oh, the default seed use ClassName.hashCode so here it is 
âorg.apache.spark.ml.classification.MultilayerPerceptronClassifierâ.hashCode()
 and it equals -763139545


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14947: [SPARK-17388][SQL] Support for inferring type date/times...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14947
  
**[Test build #65234 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65234/consoleFull)**
 for PR 14947 at commit 
[`e9dea77`](https://github.com/apache/spark/commit/e9dea77d0b2fdf08cff0ad30cb081ae48c90ba94).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14947: [SPARK-17388][SQL] Support for inferring type dat...

2016-09-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14947#discussion_r78306768
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -307,20 +308,34 @@ object PartitioningUtils {
 
   /**
* Converts a string to a [[Literal]] with automatic type inference.  
Currently only supports
-   * [[IntegerType]], [[LongType]], [[DoubleType]], 
[[DecimalType.SYSTEM_DEFAULT]], and
-   * [[StringType]].
+   * [[IntegerType]], [[LongType]], [[DoubleType]], 
[[DecimalType.SYSTEM_DEFAULT]], [[DateType]]
+   * [[TimestampType]], and [[StringType]].
*/
   private[datasources] def inferPartitionColumnValue(
   raw: String,
   defaultPartitionName: String,
   typeInference: Boolean): Literal = {
+val decimalTry = Try {
+  // `BigDecimal` conversion can fail when the `field` is not a form 
of number.
+  val bigDecimal = new JBigDecimal(raw)
+  // It reduces the cases for decimals by disallowing values having 
scale (eg. `1.1`).
+  require(bigDecimal.scale <= 0)
+  // `DecimalType` conversion can fail when
+  //   1. The precision is bigger than 38.
+  //   2. scale is bigger than precision.
--- End diff --

Checked and I added some more end-to-end tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #7266: [SPARK-8764][ML] string indexer should take option to han...

2016-09-11 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/7266
  
@miro-balaz : This probably isn't the best place for a new feature request 
- but if you head over to the ASF JIRA you can create a new ticket and cc the 
people who worked on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated S...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14971#discussion_r78305536
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -259,6 +259,156 @@ class StatisticsSuite extends QueryTest with 
TestHiveSingleton with SQLTestUtils
 }
   }
 
+  private def createNonPartitionedTable(tabName: String): 
Option[Statistics] = {
+val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+sql(
+  s"""
+ |CREATE TABLE $tabName (key STRING, value STRING)
+ |STORED AS TEXTFILE
+ |TBLPROPERTIES ('prop1' = 'val1', 'prop2' = 'val2')
+   """.stripMargin)
+sql(s"INSERT INTO TABLE $tabName SELECT * FROM src")
+sql(s"ANALYZE TABLE $tabName COMPUTE STATISTICS")
+hiveClient.runSqlHive(s"ANALYZE TABLE $tabName COMPUTE STATISTICS")
+val describeResult1 = hiveClient.runSqlHive(s"DESCRIBE FORMATTED 
$tabName")
+
+val tableMetadata =
+  
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName)).properties
+// statistics info is not contained in the metadata of the original 
table
+assert(Seq(StatsSetupConst.COLUMN_STATS_ACCURATE,
+  StatsSetupConst.NUM_FILES,
+  StatsSetupConst.NUM_PARTITIONS,
+  StatsSetupConst.ROW_COUNT,
+  StatsSetupConst.RAW_DATA_SIZE,
+  StatsSetupConst.TOTAL_SIZE).forall(!tableMetadata.contains(_)))
+
+assert(StringUtils.filterPattern(describeResult1, 
"*numRows\\s+500*").nonEmpty)
+checkStats(
+  tabName, isDataSourceTable = false, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
+  }
+
+  test("alter table rename after analyze table") {
--- End diff --

We can add more ALTER TABLE commands, if you think it is necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated S...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/14971#discussion_r78305513
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala ---
@@ -259,6 +259,156 @@ class StatisticsSuite extends QueryTest with 
TestHiveSingleton with SQLTestUtils
 }
   }
 
+  private def createNonPartitionedTable(tabName: String): 
Option[Statistics] = {
+val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+sql(
+  s"""
+ |CREATE TABLE $tabName (key STRING, value STRING)
+ |STORED AS TEXTFILE
+ |TBLPROPERTIES ('prop1' = 'val1', 'prop2' = 'val2')
+   """.stripMargin)
+sql(s"INSERT INTO TABLE $tabName SELECT * FROM src")
+sql(s"ANALYZE TABLE $tabName COMPUTE STATISTICS")
+hiveClient.runSqlHive(s"ANALYZE TABLE $tabName COMPUTE STATISTICS")
+val describeResult1 = hiveClient.runSqlHive(s"DESCRIBE FORMATTED 
$tabName")
+
+val tableMetadata =
+  
spark.sessionState.catalog.getTableMetadata(TableIdentifier(tabName)).properties
+// statistics info is not contained in the metadata of the original 
table
+assert(Seq(StatsSetupConst.COLUMN_STATS_ACCURATE,
+  StatsSetupConst.NUM_FILES,
+  StatsSetupConst.NUM_PARTITIONS,
+  StatsSetupConst.ROW_COUNT,
+  StatsSetupConst.RAW_DATA_SIZE,
+  StatsSetupConst.TOTAL_SIZE).forall(!tableMetadata.contains(_)))
+
+assert(StringUtils.filterPattern(describeResult1, 
"*numRows\\s+500*").nonEmpty)
+checkStats(
+  tabName, isDataSourceTable = false, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
+  }
+
+  test("alter table rename after analyze table") {
+val oldName = "tab1"
+val newName = "tab2"
+withTable(oldName, newName) {
+  val fetchedStats1 = createNonPartitionedTable(oldName)
+  sql(s"ALTER TABLE $oldName RENAME TO $newName")
+  val fetchedStats2 = checkStats(
+newName, isDataSourceTable = false, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
+  assert(fetchedStats1 == fetchedStats2)
+
+  // ALTER TABLE RENAME does not affect the contents of Hive specific 
statistics
+  val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+  val describeResult = hiveClient.runSqlHive(s"DESCRIBE FORMATTED 
$newName")
+  assert(StringUtils.filterPattern(describeResult, 
"*numRows\\s+500*").nonEmpty)
+}
+  }
+
+  test("alter table SET TBLPROPERTIES after analyze table") {
+val tabName = "tab1"
+withTable(tabName) {
+  val fetchedStats1 = createNonPartitionedTable(tabName)
+
+  sql(s"ALTER TABLE $tabName SET TBLPROPERTIES ('foo' = 'a')")
+  val fetchedStats2 = checkStats(
+tabName, isDataSourceTable = false, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
+  assert(fetchedStats1 == fetchedStats2)
+
+  // ALTER TABLE SET TBLPROPERTIES invalidates some contents of Hive 
specific statistics
+  // This is triggered by the Hive alterTable API
+  val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+  val describeResult = hiveClient.runSqlHive(s"DESCRIBE FORMATTED 
$tabName")
+  assert(StringUtils.filterPattern(describeResult, 
"*numRows\\s+-1*").nonEmpty)
+}
+  }
+
+  test("alter table UNSET TBLPROPERTIES after analyze table") {
+val tabName = "tab1"
+withTable(tabName) {
+  val fetchedStats1 = createNonPartitionedTable(tabName)
+
+  sql(s"ALTER TABLE $tabName UNSET TBLPROPERTIES ('prop1')")
+  val fetchedStats2 = checkStats(
+tabName, isDataSourceTable = false, hasSizeInBytes = true, 
expectedRowCounts = Some(500))
+  assert(fetchedStats1 == fetchedStats2)
+
+  // ALTER TABLE UNSET TBLPROPERTIES invalidates some contents of Hive 
specific statistics
+  // This is triggered by the Hive alterTable API
+  val hiveClient = 
spark.sharedState.externalCatalog.asInstanceOf[HiveExternalCatalog].client
+  val describeResult = hiveClient.runSqlHive(s"DESCRIBE FORMATTED 
$tabName")
+  assert(StringUtils.filterPattern(describeResult, 
"*numRows\\s+-1*").nonEmpty)
+}
+  }
+
+  test("add/drop partitions - managed table") {
--- End diff --

FYI, when we drop partitions of EXTERNAL tables, `ANALYZE TABLE` is unable 
to exclude them from statistics. This should be fixed with 
https://issues.apache.org/jira/browse/SPARK-17129, if my understanding is right.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15048
  
**[Test build #65233 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65233/consoleFull)**
 for PR 15048 at commit 
[`ae335ae`](https://github.com/apache/spark/commit/ae335ae05dda586a86e39c82ce4f8cdcf5aaa6d0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14971
  
cc @hvanhovell @cloud-fan Now, the code is ready for review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14971
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14971
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65232/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14971
  
**[Test build #65232 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65232/consoleFull)**
 for PR 14971 at commit 
[`9e18ba1`](https://github.com/apache/spark/commit/9e18ba104527d2bb14331f4b51194002dabb2556).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread WeichenXu123

Github user WeichenXu123 commented on the issue:

https://github.com/apache/spark/pull/15045
  
jenkins test please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15028: [SPARK-17336][PYSPARK] Fix appending multiple times to P...

2016-09-11 Thread holdenk

Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/15028
  
Since the search order is defined the old behavior probably worked cross 
versions (albeit in an ugly fashion) - I'll follow up with some checks for 
spark-perf and fix there if necessary since I think that's really the main 
place which _might_ have been dependent on this behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14947: [WIP][SPARK-17388][SQL] Support for inferring typ...

2016-09-11 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/14947#discussion_r78304894
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -307,20 +308,34 @@ object PartitioningUtils {
 
   /**
* Converts a string to a [[Literal]] with automatic type inference.  
Currently only supports
-   * [[IntegerType]], [[LongType]], [[DoubleType]], 
[[DecimalType.SYSTEM_DEFAULT]], and
-   * [[StringType]].
+   * [[IntegerType]], [[LongType]], [[DoubleType]], 
[[DecimalType.SYSTEM_DEFAULT]], [[DateType]]
+   * [[TimestampType]], and [[StringType]].
*/
   private[datasources] def inferPartitionColumnValue(
   raw: String,
   defaultPartitionName: String,
   typeInference: Boolean): Literal = {
+val decimalTry = Try {
+  // `BigDecimal` conversion can fail when the `field` is not a form 
of number.
+  val bigDecimal = new JBigDecimal(raw)
+  // It reduces the cases for decimals by disallowing values having 
scale (eg. `1.1`).
+  require(bigDecimal.scale <= 0)
+  // `DecimalType` conversion can fail when
+  //   1. The precision is bigger than 38.
+  //   2. scale is bigger than precision.
--- End diff --

Ah, I think I should check this requirement. I will update the description 
soon too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14828: [SPARK-17258][SQL] Parse scientific decimal literals as ...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14828
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65230/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14828: [SPARK-17258][SQL] Parse scientific decimal literals as ...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14828
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14828: [SPARK-17258][SQL] Parse scientific decimal literals as ...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14828
  
**[Test build #65230 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65230/consoleFull)**
 for PR 14828 at commit 
[`44c1b4b`](https://github.com/apache/spark/commit/44c1b4b76e823e0330cd167cf8a374a751772040).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `public class LevelDBProvider `
  * `  public static class StoreVersion `
  * `  public static class AppId `
  * `case class SlaveLost(_message: String = \"Slave lost\", workerLost: 
Boolean = false)`
  * `case class CheckCartesianProducts(conf: CatalystConf)`
  * `sealed abstract class InnerLike extends JoinType `
  * `case class Statistics(`
  * `case class AnalyzeTableCommand(tableName: String, noscan: Boolean = 
true) extends RunnableCommand `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14083
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14083
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65231/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14083
  
**[Test build #65231 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65231/consoleFull)**
 for PR 14083 at commit 
[`a1e5312`](https://github.com/apache/spark/commit/a1e5312c601a63f9feddc8980c795db236e6c735).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/15048
  
@hvanhovell Sure, will do it. Thanks! 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14971
  
**[Test build #65232 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65232/consoleFull)**
 for PR 14971 at commit 
[`9e18ba1`](https://github.com/apache/spark/commit/9e18ba104527d2bb14331f4b51194002dabb2556).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS More T...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15048
  
@gatorsmile so should we check all commands? It might also be an idea to 
have `Command` extend `LeafNode` (and make children `final`). I think @davies 
did something similar for https://github.com/apache/spark/pull/14797. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15053: [Doc] improve python API docstrings

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15053
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15053: [Doc] improve python API docstrings

2016-09-11 Thread mortada

GitHub user mortada opened a pull request:

https://github.com/apache/spark/pull/15053

[Doc] improve python API docstrings

## What changes were proposed in this pull request?

a lot of the python API functions show example usage that is incomplete. 
The docstring shows output without having the input DataFrame defined. It can 
be quite confusing trying to understand the example. This PR fixes the 
docstring.

## How was this patch tested?

docs changes only




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mortada/spark python_docstring

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15053.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15053


commit 52240bcf8df42dd454e874ce7640d7040c5cdad9
Author: Mortada Mehyar 
Date:   2016-09-11T20:28:54Z

[Doc] improve python API docstrings




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14828: [SPARK-17258][SQL] Parse scientific decimal literals as ...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14828
  
**[Test build #65230 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65230/consoleFull)**
 for PR 14828 at commit 
[`44c1b4b`](https://github.com/apache/spark/commit/44c1b4b76e823e0330cd167cf8a374a751772040).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/14083
  
**[Test build #65231 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65231/consoleFull)**
 for PR 14083 at commit 
[`a1e5312`](https://github.com/apache/spark/commit/a1e5312c601a63f9feddc8980c795db236e6c735).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14083: [SPARK-16406][SQL] Improve performance of LogicalPlan.re...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/14083
  
@JoshRosen I have moved the implementation into `AttributeSeq`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65227/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15045
  
**[Test build #65227 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65227/consoleFull)**
 for PR 15045 at commit 
[`25f1f8c`](https://github.com/apache/spark/commit/25f1f8c0f7d507c3cd515d132726ea50e018c5e5).
 * This patch **fails from timeout after a configured wait of \`250m\`**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metric in Py...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15052
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15052: [SPARK-17500][PySpark]Make DiskBytesSpilled metri...

2016-09-11 Thread djvulee

GitHub user djvulee opened a pull request:

https://github.com/apache/spark/pull/15052

[SPARK-17500][PySpark]Make DiskBytesSpilled metric in PySpark shuffle right

## What changes were proposed in this pull request?

The origin way increases the DiskBytesSpilled metric with the file size 
during each spill in ExternalMerger && ExternalGroupBy, but we only need the 
last size.

## How was this patch tested?

No extra tests, because this just update the metrics

Author: Li Hu 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/djvulee/spark PyDiskSpillMetric

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15052.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15052


commit 1b90b0dd61c22ffba6d578f73cf5aca88629b1be
Author: DjvuLee 
Date:   2016-09-11T19:41:32Z

Make DiskBytesSpilled metric in PySpark shuffle right

The origin way increase the DiskBytesSpilled metric with the file
size during each spill in ExternalMerger && ExternalGroupBy, but we only 
need the last size.

No extra Tests, because this just update the metrics

Author: Li Hu 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette

2016-09-11 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14980
  
I think having another PR on the branch-2.0 is a good idea.
Also should we have forward looking statements like 
[this](https://github.com/apache/spark/pull/14980/files#r5442) in the 
version for 2.0.0 that references 2.1.0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14980#discussion_r78301288
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -0,0 +1,853 @@
+---
+title: "SparkR - Practical Guide"
+output:
+  html_document:
+theme: united
+toc: true
+toc_depth: 4
+toc_float: true
+highlight: textmate
+---
+
+## Overview
+
+SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R. In Spark 2.0.0, SparkR provides a distributed data frame 
implementation that supports data processing operations like selection, 
filtering, aggregation etc. and distributed machine learning using 
[MLlib](http://spark.apache.org/mllib/).
+
+## Getting Started
+
+We begin with an example running on the local machine and provide an 
overview of the use of SparkR: data ingestion, data processing and machine 
learning.
+
+First, let's load and attach the package.
+```{r, message=FALSE}
+library(SparkR)
+```
+
+`SparkSession` is the entry point into SparkR which connects your R 
program to a Spark cluster. You can create a `SparkSession` using 
`sparkR.session` and pass in options such as the application name, any Spark 
packages depended on, etc.
+
+We use default settings in which it runs in local mode. It auto downloads 
Spark package in the background if no previous installation is found. For more 
details about setup, see [Spark Session](#SetupSparkSession).
+
+```{r, message=FALSE, warning=FALSE}
+sparkR.session()
+```
+
+The operations in SparkR are centered around an R class called 
`SparkDataFrame`. It is a distributed collection of data organized into named 
columns, which is conceptually equivalent to a table in a relational database 
or a data frame in R, but with richer optimizations under the hood.
+
+`SparkDataFrame` can be constructed from a wide array of sources such as: 
structured data files, tables in Hive, external databases, or existing local R 
data frames. For example, we create a `SparkDataFrame` from a local R data 
frame,
+
+```{r}
+cars <- cbind(model = rownames(mtcars), mtcars)
+carsDF <- createDataFrame(cars)
+```
+
+We can view the first few rows of the `SparkDataFrame` by `showDF` or 
`head` function.
+```{r}
+showDF(carsDF)
+```
+
+Common data processing operations such as `filter`, `select` are supported 
on the `SparkDataFrame`.
+```{r}
+carsSubDF <- select(carsDF, "model", "mpg", "hp")
+carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
+showDF(carsSubDF)
+```
+
+SparkR can use many common aggregation functions after grouping.
+
+```{r}
+carsGPDF <- summarize(groupBy(carsDF, carsDF$gear), count = n(carsDF$gear))
+showDF(carsGPDF)
+```
+
+The results `carsDF` and `carsSubDF` are `SparkDataFrame` objects. To 
convert back to R `data.frame`, we can use `collect`.
+```{r}
+carsGP <- collect(carsGPDF)
+class(carsGP)
+```
+
+SparkR supports a number of commonly used machine learning algorithms. 
Under the hood, SparkR uses MLlib to train the model. Users can call `summary` 
to print a summary of the fitted model, `predict` to make predictions on new 
data, and `write.ml`/`read.ml` to save/load fitted models.
+
+SparkR supports a subset of R formula operators for model fitting, 
including â~â, â.â, â:â, â+â, and â-â. We use linear 
regression as an example.
+```{r}
+model <- spark.glm(carsDF, mpg ~ wt + cyl)
+```
+
+```{r}
+summary(model)
+```
+
+The model can be saved by `write.ml` and loaded back using `read.ml`.
+```{r, eval=FALSE}
+write.ml(model, path = "/HOME/tmp/mlModel/glmModel")
+```
+
+In the end, we can stop Spark Session by running
+```{r, eval=FALSE}
+sparkR.session.stop()
+```
+
+## Setup
+
+### Installation
+
+Different from many other R packages, to use SparkR, you need an 
additional installation of Apache Spark. The Spark installation will be used to 
run a backend process that will compile and execute SparkR programs.
+
+If you don't have Spark installed on the computer, you may download it 
from [Apache Spark Website](http://spark.apache.org/downloads.html). 
Alternatively, we provide an easy-to-use function `install.spark` to complete 
this process.
+
+```{r, eval=FALSE}
+install.spark()
+```
+
+If you already have Spark installed, you don't have to install again and 
can pass the `sparkHome` argument to `sparkR.session` to let SparkR know where 
the Spark installation is.
+
+```{r, eval=FALSE}
+sparkR.session(sparkHome = "/HOME/spark")
+```
+
+### Spark Session {#SetupSparkSession}
+
+**For Windows users**: Due to different file

[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14980#discussion_r78301238
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -0,0 +1,853 @@
+---
+title: "SparkR - Practical Guide"
+output:
+  html_document:
+theme: united
+toc: true
+toc_depth: 4
+toc_float: true
+highlight: textmate
+---
+
+## Overview
+
+SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R. In Spark 2.0.0, SparkR provides a distributed data frame 
implementation that supports data processing operations like selection, 
filtering, aggregation etc. and distributed machine learning using 
[MLlib](http://spark.apache.org/mllib/).
+
+## Getting Started
+
+We begin with an example running on the local machine and provide an 
overview of the use of SparkR: data ingestion, data processing and machine 
learning.
+
+First, let's load and attach the package.
+```{r, message=FALSE}
+library(SparkR)
+```
+
+`SparkSession` is the entry point into SparkR which connects your R 
program to a Spark cluster. You can create a `SparkSession` using 
`sparkR.session` and pass in options such as the application name, any Spark 
packages depended on, etc.
+
+We use default settings in which it runs in local mode. It auto downloads 
Spark package in the background if no previous installation is found. For more 
details about setup, see [Spark Session](#SetupSparkSession).
+
+```{r, message=FALSE, warning=FALSE}
+sparkR.session()
+```
+
+The operations in SparkR are centered around an R class called 
`SparkDataFrame`. It is a distributed collection of data organized into named 
columns, which is conceptually equivalent to a table in a relational database 
or a data frame in R, but with richer optimizations under the hood.
+
+`SparkDataFrame` can be constructed from a wide array of sources such as: 
structured data files, tables in Hive, external databases, or existing local R 
data frames. For example, we create a `SparkDataFrame` from a local R data 
frame,
+
+```{r}
+cars <- cbind(model = rownames(mtcars), mtcars)
+carsDF <- createDataFrame(cars)
+```
+
+We can view the first few rows of the `SparkDataFrame` by `showDF` or 
`head` function.
+```{r}
+showDF(carsDF)
+```
+
+Common data processing operations such as `filter`, `select` are supported 
on the `SparkDataFrame`.
+```{r}
+carsSubDF <- select(carsDF, "model", "mpg", "hp")
+carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
+showDF(carsSubDF)
+```
+
+SparkR can use many common aggregation functions after grouping.
+
+```{r}
+carsGPDF <- summarize(groupBy(carsDF, carsDF$gear), count = n(carsDF$gear))
+showDF(carsGPDF)
+```
+
+The results `carsDF` and `carsSubDF` are `SparkDataFrame` objects. To 
convert back to R `data.frame`, we can use `collect`.
+```{r}
+carsGP <- collect(carsGPDF)
+class(carsGP)
+```
+
+SparkR supports a number of commonly used machine learning algorithms. 
Under the hood, SparkR uses MLlib to train the model. Users can call `summary` 
to print a summary of the fitted model, `predict` to make predictions on new 
data, and `write.ml`/`read.ml` to save/load fitted models.
+
+SparkR supports a subset of R formula operators for model fitting, 
including â~â, â.â, â:â, â+â, and â-â. We use linear 
regression as an example.
+```{r}
+model <- spark.glm(carsDF, mpg ~ wt + cyl)
--- End diff --

Fair enough. Let's leave this as-is then.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14980: [SPARK-17317][SparkR] Add SparkR vignette

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/14980#discussion_r78301233
  
--- Diff: R/pkg/vignettes/sparkr-vignettes.Rmd ---
@@ -0,0 +1,853 @@
+---
+title: "SparkR - Practical Guide"
+output:
+  html_document:
+theme: united
+toc: true
+toc_depth: 4
+toc_float: true
+highlight: textmate
+---
+
+## Overview
+
+SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R. With Spark `r packageVersion("SparkR")`, SparkR provides a 
distributed data frame implementation that supports data processing operations 
like selection, filtering, aggregation etc. and distributed machine learning 
using [MLlib](http://spark.apache.org/mllib/).
+
+## Getting Started
+
+We begin with an example running on the local machine and provide an 
overview of the use of SparkR: data ingestion, data processing and machine 
learning.
+
+First, let's load and attach the package.
+```{r, message=FALSE}
+library(SparkR)
+```
+
+`SparkSession` is the entry point into SparkR which connects your R 
program to a Spark cluster. You can create a `SparkSession` using 
`sparkR.session` and pass in options such as the application name, any Spark 
packages depended on, etc.
+
+We use default settings in which it runs in local mode. It auto downloads 
Spark package in the background if no previous installation is found. For more 
details about setup, see [Spark Session](#SetupSparkSession).
+
+```{r, message=FALSE}
+sparkR.session()
+```
+
+The operations in SparkR are centered around an R class called 
`SparkDataFrame`. It is a distributed collection of data organized into named 
columns, which is conceptually equivalent to a table in a relational database 
or a data frame in R, but with richer optimizations under the hood.
+
+`SparkDataFrame` can be constructed from a wide array of sources such as: 
structured data files, tables in Hive, external databases, or existing local R 
data frames. For example, we create a `SparkDataFrame` from a local R data 
frame,
+
+```{r}
+cars <- cbind(model = rownames(mtcars), mtcars)
+carsDF <- createDataFrame(cars)
+```
+
+We can view the first few rows of the `SparkDataFrame` by `head` or 
`showDF` function.
+```{r}
+head(carsDF)
+```
+
+Common data processing operations such as `filter`, `select` are supported 
on the `SparkDataFrame`.
+```{r}
+carsSubDF <- select(carsDF, "model", "mpg", "hp")
+carsSubDF <- filter(carsSubDF, carsSubDF$hp >= 200)
+head(carsSubDF)
+```
+
+SparkR can use many common aggregation functions after grouping.
+
+```{r}
+carsGPDF <- summarize(groupBy(carsDF, carsDF$gear), count = n(carsDF$gear))
+head(carsGPDF)
+```
+
+The results `carsDF` and `carsSubDF` are `SparkDataFrame` objects. To 
convert back to R `data.frame`, we can use `collect`. **Caution**: This can 
cause the driver to run out of memory, though, because `collect()` fetches the 
entire distributed `DataFrame` to a single machine;
--- End diff --

I think I'd suggest tuning this wording slightly - since the audience 
reading this vignette is likely running SparkR interactively, it might not be 
obvious what the "driver" is, so perhaps something like "This can cause your 
interactive environment to run out of memory because `collect()` fetches the 
entire distributed `DataFrame` to your client, which is acting as a Spark 
driver."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread shivaram

Github user shivaram commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78301160
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

Its also better to not make an argument required in the middle -- i.e. if 
we want to make layers a required argument then we should move it before 
blockSize


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78301071
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
+   tol = 1E-6, stepSize = 0.03, seed = -763139545) {
--- End diff --

doesn't look like seed default to this value? could you point out where 
that is specified?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15051: [SPARK-17499][ML][MLLib] make the default params in spar...

2016-09-11 Thread felixcheung

Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/15051
  
thanks - could you add some tests that use these default values? (esp. 
layers as NULL)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15051: [SPARK-17499][ML][MLLib] make the default params ...

2016-09-11 Thread felixcheung

Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/15051#discussion_r78301008
  
--- Diff: R/pkg/R/mllib.R ---
@@ -694,8 +694,8 @@ setMethod("predict", signature(object = "KMeansModel"),
 #' }
 #' @note spark.mlp since 2.1.0
 setMethod("spark.mlp", signature(data = "SparkDataFrame"),
-  function(data, blockSize = 128, layers = c(3, 5, 2), solver = 
"l-bfgs", maxIter = 100,
-   tol = 0.5, stepSize = 1, seed = 1) {
+  function(data, blockSize = 128, layers, solver = "l-bfgs", 
maxIter = 100,
--- End diff --

I think the preference would be to have `layers = c()` - it helps to show 
that it should be a vector of potentially multiple values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15045
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65229/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15045
  
**[Test build #65229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65229/consoleFull)**
 for PR 15045 at commit 
[`f53ad51`](https://github.com/apache/spark/commit/f53ad51cde74429dcd45505d89459d4a9d3a64cb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78299559
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
--- End diff --

It might be the easiest to isolate element hashing for 
`Maps`/`Arrays`/`Rows` in the super class. Code generation is basically string 
concatenation with some fancy tricks (look in `CodegenContext`). Ping me if you 
need a hand. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/15048#discussion_r78299463
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
@@ -37,7 +38,9 @@ case class CreateTable(tableDesc: CatalogTable, mode: 
SaveMode, query: Option[Lo
 
   override def output: Seq[Attribute] = Seq.empty[Attribute]
 
-  override def children: Seq[LogicalPlan] = query.toSeq
+  override def children: Seq[LogicalPlan] = Seq.empty[LogicalPlan]
--- End diff --

Yeah. : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-11 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78299323
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
--- End diff --

Good catch. As far as I see, it will produce incorrect results for string 
and complex data types and I will have to codegen those : 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L292

PS: I have not worked on codegen stuff before. How does one write it, 
verify and test codegen ? On the surface it looks like writing code in a string 
but wondering if there are some cult ways which helps with this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78299210
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
+  def this(arguments: Seq[Expression]) = this(arguments, 42)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "hive-hash"
+
+  override protected def hasherClassName: String = 
classOf[HiveHash].getName
+
+  override protected def computeHash(value: Any, dataType: DataType, seed: 
Int): Int = {
+HiveHashFunction.hash(value, dataType, seed).toInt
+  }
+}
+
+object HiveHashFunction extends InterpretedHashFunction {
+  override protected def hashInt(i: Int, seed: Long): Long = {
+HiveHasher.hashInt(i, seed.toInt)
+  }
+
+  override protected def hashLong(l: Long, seed: Long): Long = {
+HiveHasher.hashLong(l, seed.toInt)
+  }
+
+  override protected def hashUnsafeBytes(base: AnyRef, offset: Long, len: 
Int, seed: Long): Long = {
+HiveHasher.hashUnsafeBytes(base, offset, len, seed.toInt)
+  }
+
+  override def hash(value: Any, dataType: DataType, seed: Long): Long = {
+value match {
+  case s: UTF8String =>
+val bytes = s.getBytes
+var result: Int = 0
+var i = 0
+while (i < bytes.length) {
+  result = (result * 31) + bytes(i).toInt
+  i += 1
+}
+result
+
+
+  case array: ArrayData =>
+val elementType = dataType match {
+  case udt: UserDefinedType[_] => 
udt.sqlType.asInstanceOf[ArrayType].elementType
--- End diff --

@cloud-fan I think you wrote the initial version. Could you you tell us 
what is happening here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-11 Thread tejasapatil

Github user tejasapatil commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78299106
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
+  def this(arguments: Seq[Expression]) = this(arguments, 42)
+
+  override def dataType: DataType = IntegerType
+
+  override def prettyName: String = "hive-hash"
+
+  override protected def hasherClassName: String = 
classOf[HiveHash].getName
+
+  override protected def computeHash(value: Any, dataType: DataType, seed: 
Int): Int = {
+HiveHashFunction.hash(value, dataType, seed).toInt
+  }
+}
+
+object HiveHashFunction extends InterpretedHashFunction {
+  override protected def hashInt(i: Int, seed: Long): Long = {
+HiveHasher.hashInt(i, seed.toInt)
+  }
+
+  override protected def hashLong(l: Long, seed: Long): Long = {
+HiveHasher.hashLong(l, seed.toInt)
+  }
+
+  override protected def hashUnsafeBytes(base: AnyRef, offset: Long, len: 
Int, seed: Long): Long = {
+HiveHasher.hashUnsafeBytes(base, offset, len, seed.toInt)
+  }
+
+  override def hash(value: Any, dataType: DataType, seed: Long): Long = {
+value match {
+  case s: UTF8String =>
+val bytes = s.getBytes
+var result: Int = 0
+var i = 0
+while (i < bytes.length) {
+  result = (result * 31) + bytes(i).toInt
+  i += 1
+}
+result
+
+
+  case array: ArrayData =>
+val elementType = dataType match {
+  case udt: UserDefinedType[_] => 
udt.sqlType.asInstanceOf[ArrayType].elementType
--- End diff --

I mimic'ed exactly what happens in case of `Murmur3Hash`. See  
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L388


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15051: [SPARK-17499][ML][MLLib] make the default params in spar...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15051
  
**[Test build #65228 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65228/consoleFull)**
 for PR 15051 at commit 
[`8a87b86`](https://github.com/apache/spark/commit/8a87b8696f68cc9d11b4b46a3eeef2986f6b9a0a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15051: [SPARK-17499][ML][MLLib] make the default params in spar...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15051
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65228/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15051: [SPARK-17499][ML][MLLib] make the default params in spar...

2016-09-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15051
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14644: [MESOS] Enable GPU support with Mesos

2016-09-11 Thread klueska

Github user klueska commented on a diff in the pull request:

https://github.com/apache/spark/pull/14644#discussion_r78298850
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -103,6 +103,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
   private val stateLock = new ReentrantLock
 
   val extraCoresPerExecutor = conf.getInt("spark.mesos.extra.cores", 0)
+  val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)
--- End diff --

I'm not saying it's not sensible. I'm just trying to figure what I can do 
to tell it to accept all GPUs in the offer (which is what I want in my setup). 
Some offers have more than others and it feels weird to just pick a really big 
number to ensure that I get them all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15051: [SPARK-17499][ML][MLLib] make the default params in spar...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15051
  
**[Test build #65228 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65228/consoleFull)**
 for PR 15051 at commit 
[`8a87b86`](https://github.com/apache/spark/commit/8a87b8696f68cc9d11b4b46a3eeef2986f6b9a0a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15045: [Spark Core][MINOR] fix "default partitioner cannot part...

2016-09-11 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15045
  
**[Test build #65229 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65229/consoleFull)**
 for PR 15045 at commit 
[`f53ad51`](https://github.com/apache/spark/commit/f53ad51cde74429dcd45505d89459d4a9d3a64cb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14644: [MESOS] Enable GPU support with Mesos

2016-09-11 Thread tnachen

Github user tnachen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14644#discussion_r78298417
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala
 ---
@@ -103,6 +103,7 @@ private[spark] class MesosCoarseGrainedSchedulerBackend(
   private val stateLock = new ReentrantLock
 
   val extraCoresPerExecutor = conf.getInt("spark.mesos.extra.cores", 0)
+  val maxGpus = conf.getInt("spark.mesos.gpus.max", 0)
--- End diff --

Which sounds sensible to me since GPU is not usually required to run your 
Spark job. And also cores.max is an aggregate max, where gpu.max as the current 
patch is a per node max. I think I will change this into how cores.max work, 
but default to 0. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14980: [SPARK-17317][SparkR] Add SparkR vignette

2016-09-11 Thread shivaram

Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/14980
  
Yeah I think it'll be good to do a separate PR and make sure it can build 
corresponding to the Scala code in branch-2.0 etc. But lets do it after all the 
comments here are addressed and this is merged.

@felixcheung Could you take one more pass ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15048: [SPARK-17409] [SQL] Do Not Optimize Query in CTAS...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15048#discussion_r78298129
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala ---
@@ -37,7 +38,9 @@ case class CreateTable(tableDesc: CatalogTable, mode: 
SaveMode, query: Option[Lo
 
   override def output: Seq[Attribute] = Seq.empty[Attribute]
 
-  override def children: Seq[LogicalPlan] = query.toSeq
+  override def children: Seq[LogicalPlan] = Seq.empty[LogicalPlan]
--- End diff --

extend LeafNode?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15047: [SPARK-17495] [SQL] Add Hash capability semantically equ...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on the issue:

https://github.com/apache/spark/pull/15047
  
@tejasapatil this looks pretty good overal. I left a few comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15047: [SPARK-17495] [SQL] Add Hash capability semantica...

2016-09-11 Thread hvanhovell

Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/15047#discussion_r78297960
  
--- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala 
---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hive
+
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
+import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.Platform
+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Simulates Hive's hashing function at
+ * 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils#hashcode() 
in Hive
+ *
+ * We should use this hash function for both shuffle and bucket of Hive 
tables, so that
+ * we can guarantee shuffle and bucketing have same data distribution
+ *
+ * TODO: Support Decimal and date related types
+ */
+@ExpressionDescription(
+  usage = "_FUNC_(a1, a2, ...) - Returns a hash value of the arguments.")
+case class HiveHash(children: Seq[Expression], seed: Int) extends 
HashExpression[Int] {
--- End diff --

This will also produce a code generated hash code. Does the current 
implementation produce a Hive compatible hash?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14971: [SPARK-17410] [SPARK-17284] Move Hive-generated Stats In...

2016-09-11 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/14971
  
... Very surprised about Hive... Any `ALTER TABLE SET/UNSET TBLPROPERTIES` 
statements can invalidate the Hive-generated statistics...

```Scala
hiveClient.runSqlHive(s"ANALYZE TABLE $oldName COMPUTE STATISTICS")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)
```
```
Table Parameters:
COLUMN_STATS_ACCURATE   true
numFiles1   
numRows 500 
rawDataSize 5312
spark.sql.statistics.numRows500 
spark.sql.statistics.totalSize  5812
totalSize   5812
transient_lastDdlTime   1473610039  
```
```Scala
hiveClient.runSqlHive(s"ALTER TABLE $oldName SET TBLPROPERTIES ('foofoo' = 
'a')")
hiveClient.runSqlHive(s"DESCRIBE FORMATTED $oldName").foreach(println)
```
```
Table Parameters:
COLUMN_STATS_ACCURATE   false   
foofoo  a   
last_modified_byxiaoli  
last_modified_time  1473610039  
numFiles1   
numRows -1  
rawDataSize -1  
spark.sql.statistics.numRows500 
spark.sql.statistics.totalSize  5812
totalSize   5812
transient_lastDdlTime   1473610039  
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 178 matches

Mail list logo