[GitHub] spark issue #18488: [SPARK-21255][SQL] Fixed NPE when creating encoder for e...

2017-06-30 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/18488
  
@mike0sv looks good, thanks. It would help us for ease of understanding in 
the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/18495
  
LGTM, merged into master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18496
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18496
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79012/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18496
  
**[Test build #79012 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79012/testReport)**
 for PR 18496 at commit 
[`fa7bd4b`](https://github.com/apache/spark/commit/fa7bd4b38cb8335a9a42f53424dad9f78eeae8b2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16028
  
**[Test build #79013 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79013/testReport)**
 for PR 16028 at commit 
[`2e280d5`](https://github.com/apache/spark/commit/2e280d5c2e5b4d05ac57ab4f72b28731d4ef7f9d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18023
  
Let me post a confusing error message.
```Scala
withSQLConf(SQLConf.SUPPORT_QUOTED_REGEX_COLUMN_NAME.key -> "true") {
  Seq((1, 1)).toDF("key", "value").createOrReplaceTempView("test")
  sql("select `key` from test where `key` > 3").show()
}
```

The error message is:
```
Invalid call to dataType on unresolved object, tree: key
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
dataType on unresolved object, tree: key
at 
org.apache.spark.sql.catalyst.analysis.Star.dataType(unresolved.scala:250)
at 
org.apache.spark.sql.catalyst.expressions.BinaryOperator.checkInputDataTypes(Expression.scala:517)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:92)
```

Users might use quoted attributes in any place of a query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18023#discussion_r125155245
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -847,6 +847,12 @@ object SQLConf {
   .intConf
   
.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)
 
+  val SUPPORT_QUOTED_REGEX_COLUMN_NAME = 
buildConf("spark.sql.parser.quotedRegexColumnNames")
+.doc("When true, quoted Identifiers (using backticks) in SELECT 
statement are interpreted" +
--- End diff --

Not only select statement. It can be almost any query.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17758: [SPARK-20460][SPARK-21144][SQL] Make it more consistent ...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17758
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79011/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17758: [SPARK-20460][SPARK-21144][SQL] Make it more consistent ...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17758
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17758: [SPARK-20460][SPARK-21144][SQL] Make it more consistent ...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17758
  
**[Test build #79011 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79011/testReport)**
 for PR 17758 at commit 
[`a874fcc`](https://github.com/apache/spark/commit/a874fcc776cc97480e55a92e6ae4193d4c71c72d).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class PreprocessDDLCommands(sparkSession: SparkSession) extends 
Rule[LogicalPlan] `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18496
  
**[Test build #79012 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79012/testReport)**
 for PR 18496 at commit 
[`fa7bd4b`](https://github.com/apache/spark/commit/fa7bd4b38cb8335a9a42f53424dad9f78eeae8b2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16028#discussion_r125155063
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -53,7 +53,23 @@ import org.apache.spark.storage.StorageLevel
 private[regression] trait LinearRegressionParams extends PredictorParams
 with HasRegParam with HasElasticNetParam with HasMaxIter with HasTol
 with HasFitIntercept with HasStandardization with HasWeightCol with 
HasSolver
-with HasAggregationDepth
+with HasAggregationDepth {
+
+  import LinearRegression._
+
+  /**
+   * The solver algorithm for optimization.
+   * Supported options: "l-bfgs", "normal" and "auto".
+   * Default: "auto"
+   *
+   * @group expertParam
+   */
+  @Since("2.3.0")
--- End diff --

```expertParam``` -> ```param```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16028: [SPARK-18518][ML] HasSolver supports override

2017-06-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16028#discussion_r125155049
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 ---
@@ -75,17 +78,13 @@ private[classification] trait 
MultilayerPerceptronParams extends PredictorParams
* Supported options: "gd" (minibatch gradient descent) or "l-bfgs".
* Default: "l-bfgs"
*
-   * @group expertParam
+   * @group param
--- End diff --

Here should be ```expertParam```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18496
  
**[Test build #79009 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79009/testReport)**
 for PR 18496 at commit 
[`a2cdf51`](https://github.com/apache/spark/commit/a2cdf511f6ad346efcb81d51f3b805a34063fa0f).
 * This patch **fails SparkR unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18496
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79009/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18496
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154756
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = 
NULL,
featureSubsetStrategy = "auto", seed = NULL, 
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0, 
checkpointInterval = 10,
-   maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+   maxMemoryInMB = 256, cacheNodeIds = FALSE, 
handleInvalid = "error") {
--- End diff --

Let me check how to use match.arg().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154735
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
--- End diff --

I think the `labels` means the string label of a feature, which is 
categorical (e.g., `white`, `black`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18481: [SPARK-20889][SparkR] Grouped documentation for W...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18481#discussion_r125154492
  
--- Diff: R/pkg/R/functions.R ---
@@ -2871,10 +2874,10 @@ setMethod("ifelse",
 
 #' @details
 #' \code{cume_dist}: Returns the cumulative distribution of values within 
a window partition,
-#' i.e. the fraction of rows that are below the current row.
-#' N = total number of rows in the partition
-#' cume_dist(x) = number of values before (and including) x / N
+#' i.e. the fraction of rows that are below the current row:
+#' number of values before (and including) x / total number of rows in the 
partition.
--- End diff --

with that many words it seems the fact this is a formula is lost. how about 
this like the other formula
```

(number of values before and including x) / (total number of rows in the 
partition)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18481: [SPARK-20889][SparkR] Grouped documentation for W...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18481#discussion_r125154702
  
--- Diff: R/pkg/R/functions.R ---
@@ -2844,27 +2872,16 @@ setMethod("ifelse",
 
 ## Window functions##
 
-#' cume_dist
-#'
-#' Window function: returns the cumulative distribution of values within a 
window partition,
-#' i.e. the fraction of rows that are below the current row.
-#'
-#'   N = total number of rows in the partition
-#'   cume_dist(x) = number of values before (and including) x / N
-#'
+#' @details
+#' \code{cume_dist}: Returns the cumulative distribution of values within 
a window partition,
+#' i.e. the fraction of rows that are below the current row:
+#' number of values before (and including) x / total number of rows in the 
partition.
 #' This is equivalent to the \code{CUME_DIST} function in SQL.
+#' This should be used with no argument.
--- End diff --

in the earlier PR, you have `The method should be used with no argument.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18485: [SPARK-21267][SS][DOCS] Update Structured Streami...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18485#discussion_r125154655
  
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -758,6 +763,16 @@ count(groupBy(df, "deviceType"))
 
 
 
+You can also register a streaming DataFrame/Dataset as a temporary view 
and then apply SQL commands on it.
+
+{% highlight scala %}
--- End diff --

enclose this in
```

 

```
?
or, add example in java, python, r too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154606
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -374,6 +374,10 @@ setMethod("write.ml", signature(object = 
"GBTClassificationModel", path = "chara
 #' nodes. If TRUE, the algorithm will cache node IDs 
for each instance. Caching
 #' can speed up training of deeper trees. Users can 
set how often should the
 #' cache be checkpointed or disable it by setting 
checkpointInterval.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in classification model.
--- End diff --

is this on "features" or "labels"? it seems it's only set on RFormula.terms 
which are features


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/18496#discussion_r125154616
  
--- Diff: R/pkg/R/mllib_tree.R ---
@@ -409,7 +413,7 @@ setMethod("spark.randomForest", signature(data = 
"SparkDataFrame", formula = "fo
maxDepth = 5, maxBins = 32, numTrees = 20, impurity = 
NULL,
featureSubsetStrategy = "auto", seed = NULL, 
subsamplingRate = 1.0,
minInstancesPerNode = 1, minInfoGain = 0.0, 
checkpointInterval = 10,
-   maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+   maxMemoryInMB = 256, cacheNodeIds = FALSE, 
handleInvalid = "error") {
--- End diff --

use match.arg(), and then no need to as.character(handleInvalid)

also, perhaps handleInvalid is a bit generic? maybe something that says it 
has to do with labels? or label string indexing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18462: [Docs] Removed invalid joinTypes from javadoc of Dataset...

2017-06-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18462
  
how about checking if we have tests for these two types (as not supported)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18307
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18307
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79007/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18307
  
**[Test build #79007 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79007/testReport)**
 for PR 18307 at commit 
[`cba1b0e`](https://github.com/apache/spark/commit/cba1b0e6c3ac32f7cb327ead54f6e8307aed00ac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17280: [SPARK-19939] [ML] Add support for association ru...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17280#discussion_r125154331
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -186,29 +186,29 @@ class FPGrowth(JavaEstimator, HasItemsCol, 
HasPredictionCol,
 |[z] |
 |[x, z, y, r, q, t, p]   |
 ++
->>> fp = FPGrowth(minSupport=0.2, minConfidence=0.7)
+>>> fp = FPGrowth(minSupport=0.4, minConfidence=0.7)
 >>> fpm = fp.fit(data)
 >>> fpm.freqItemsets.show(5)
-+-++
-|items|freq|
-+-++
-|  [s]|   3|
-|   [s, x]|   3|
-|[s, x, z]|   2|
-|   [s, z]|   2|
-|  [r]|   3|
-+-++
++--++
+| items|freq|
++--++
+|   [s]|   3|
+|[s, x]|   3|
+|   [r]|   3|
+|   [y]|   3|
+|[y, x]|   3|
++--++
--- End diff --

this seems to change the result quite a bit, is this expected?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17280: [SPARK-19939] [ML] Add support for association ru...

2017-06-30 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/17280#discussion_r125154338
  
--- Diff: python/pyspark/ml/fpm.py ---
@@ -186,29 +186,29 @@ class FPGrowth(JavaEstimator, HasItemsCol, 
HasPredictionCol,
 |[z] |
 |[x, z, y, r, q, t, p]   |
 ++
->>> fp = FPGrowth(minSupport=0.2, minConfidence=0.7)
+>>> fp = FPGrowth(minSupport=0.4, minConfidence=0.7)
 >>> fpm = fp.fit(data)
 >>> fpm.freqItemsets.show(5)
-+-++
-|items|freq|
-+-++
-|  [s]|   3|
-|   [s, x]|   3|
-|[s, x, z]|   2|
-|   [s, z]|   2|
-|  [r]|   3|
-+-++
++--++
+| items|freq|
++--++
+|   [s]|   3|
+|[s, x]|   3|
+|   [r]|   3|
+|   [y]|   3|
+|[y, x]|   3|
++--++
--- End diff --

or rather, why is 
https://github.com/apache/spark/pull/17280/files#diff-b6dbf16870bd2cca9b4140df8aebd681L189
 changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14431: [SPARK-16258][SparkR] Automatically append the grouping ...

2017-06-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/14431
  
btw, if the key is the very first column, that sounds like prefix and not 
append?
perhaps `return.data.frame.key.column` = `FALSE`?

and about your comment, do you mean `key` in `function(key, x) { x }`?
IMO it's quite helpful to know what group (ie. key) is the UDF processing?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79008/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18334#discussion_r125154201
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
 ---
@@ -97,6 +106,10 @@ object CommandUtils extends Logging {
   0L
   }
 }.getOrElse(0L)
+val durationInMs = (System.nanoTime() - startTime) / (1000 * 1000)
+logInfo(s"It took $durationInMs ms to calculate the total file size 
under path $locationUri.")
--- End diff --

Actually, the log message contains the timestamp. It does not need to 
calculate the total time, but I think it is fine here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79008 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79008/testReport)**
 for PR 18444 at commit 
[`930d16b`](https://github.com/apache/spark/commit/930d16bc26128584bcd6e1194f1340f1bba86fc9).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18479: [SPARK-21273][SQL] Propagate logical plan stats u...

2017-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18479


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17640: [SPARK-17608][SPARKR]:Long type has incorrect serializat...

2017-06-30 Thread wangmiao1981
Github user wangmiao1981 commented on the issue:

https://github.com/apache/spark/pull/17640
  
@jiangxb1987 The original PR has some issues that are not correctly 
handled. I will open a new PR when I figure out the right fix. I intended to 
close this PR. Thanks for closing it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18479: [SPARK-21273][SQL] Propagate logical plan stats using vi...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18479
  
Merging to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18479: [SPARK-21273][SQL] Propagate logical plan stats using vi...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/18479
  
LGTM 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79010 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79010/testReport)**
 for PR 18444 at commit 
[`1b1c419`](https://github.com/apache/spark/commit/1b1c419ff73117508c65a424839063902ce70d21).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79010/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18465: [SPARK-21093][R] Terminate R's worker processes in the p...

2017-06-30 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/18465
  
I see, quite possibly it is bubbled up more because of that change. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18479: [SPARK-21273][SQL] Propagate logical plan stats u...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18479#discussion_r125154131
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala
 ---
@@ -77,37 +77,6 @@ class BasicStatsEstimationSuite extends 
StatsEstimationTestBase {
 checkStats(globalLimit, stats)
   }
 
-  test("sample estimation") {
-val sample = Sample(0.0, 0.5, withReplacement = false, (math.random * 
1000).toLong, plan)
-checkStats(sample, Statistics(sizeInBytes = 60, rowCount = Some(5)))
-
-// Child doesn't have rowCount in stats
-val childStats = Statistics(sizeInBytes = 120)
-val childPlan = DummyLogicalPlan(childStats, childStats)
-val sample2 =
-  Sample(0.0, 0.11, withReplacement = false, (math.random * 
1000).toLong, childPlan)
-checkStats(sample2, Statistics(sizeInBytes = 14))
-  }
-
-  test("estimate statistics when the conf changes") {
-val expectedDefaultStats =
-  Statistics(
-sizeInBytes = 40,
-rowCount = Some(10),
-attributeStats = AttributeMap(Seq(
-  AttributeReference("c1", IntegerType)() -> ColumnStat(10, 
Some(1), Some(10), 0, 4, 4
-val expectedCboStats =
-  Statistics(
-sizeInBytes = 4,
-rowCount = Some(1),
-attributeStats = AttributeMap(Seq(
-  AttributeReference("c1", IntegerType)() -> ColumnStat(1, 
Some(5), Some(5), 0, 4, 4
-
-val plan = DummyLogicalPlan(defaultStats = expectedDefaultStats, 
cboStats = expectedCboStats)
-checkStats(
-  plan, expectedStatsCboOn = expectedCboStats, expectedStatsCboOff = 
expectedDefaultStats)
-  }
--- End diff --

Need to add them back in the follow-up PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79010 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79010/testReport)**
 for PR 18444 at commit 
[`1b1c419`](https://github.com/apache/spark/commit/1b1c419ff73117508c65a424839063902ce70d21).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid t...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18496
  
**[Test build #79009 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79009/testReport)**
 for PR 18496 at commit 
[`a2cdf51`](https://github.com/apache/spark/commit/a2cdf511f6ad346efcb81d51f3b805a34063fa0f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17758: [SPARK-20460][SPARK-21144][SQL] Make it more consistent ...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17758
  
**[Test build #79011 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79011/testReport)**
 for PR 17758 at commit 
[`a874fcc`](https://github.com/apache/spark/commit/a874fcc776cc97480e55a92e6ae4193d4c71c72d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18496: [SparkR][SPARK-20307]:SparkR: pass on setHandleIn...

2017-06-30 Thread wangmiao1981
GitHub user wangmiao1981 opened a pull request:

https://github.com/apache/spark/pull/18496

[SparkR][SPARK-20307]:SparkR: pass on setHandleInvalid to spark.mllib 
functions that use StringIndexer

## What changes were proposed in this pull request?

For randomForest classifier, if test data contains unseen labels, it will 
throw an error. The StringIndexer already has the handleInvalid logic. The 
patch add a new method to set the underlying StringIndexer handleInvalid logic.

This patch should also apply to other classifiers. This PR focuses on the 
main logic and randomForest classifier. I will do follow-up PR for other 
classifiers.

## How was this patch tested?

Add a new unit test based on the error case in the JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangmiao1981/spark handle

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18496


commit a2cdf511f6ad346efcb81d51f3b805a34063fa0f
Author: wangmiao1981 
Date:   2017-07-01T04:00:27Z

handle unseen labels




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154011
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
--- End diff --

can you add a test for the vector representation as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154035
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

this seems a little weird, it feels like it would be more natural to call 
the `findSynonymsArray` from python then do the map in Python, but I guess this 
might be a little faster


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154018
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
+Returns an array with two fields word and similarity (which
+gives the cosine similarity).
+"""
+if not isinstance(word, basestring):
+word = _convert_to_vector(word)
+tupleOfArray = self._call_java("findSynonymsTuple", word, num)
+arrayOfTuple = list(zip(tupleOfArray._1(), tupleOfArray._2()))
--- End diff --

I'm glad this approach worked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18479: [SPARK-21273][SQL] Propagate logical plan stats u...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18479#discussion_r125153313
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical
+
+/**
+ * A visitor pattern for traversing a [[LogicalPlan]] tree and compute 
some properties.
+ */
+trait LogicalPlanVisitor[T] {
+
+  def visit(p: LogicalPlan): T = p match {
+case p: Aggregate => visitAggregate(p)
+case p: Distinct => visitDistinct(p)
+case p: Except => visitExcept(p)
+case p: Expand => visitExpand(p)
+case p: Filter => visitFilter(p)
+case p: Generate => visitGenerate(p)
+case p: GlobalLimit => visitGlobalLimit(p)
+case p: Intersect => visitIntersect(p)
+case p: Join => visitJoin(p)
+case p: LocalLimit => visitLocalLimit(p)
+case p: Pivot => visitPivot(p)
+case p: Project => visitProject(p)
+case p: Range => visitRange(p)
+case p: Repartition => visitRepartition(p)
+case p: RepartitionByExpression => visitRepartitionByExpr(p)
+case p: Sample => visitSample(p)
+case p: ScriptTransformation => visitScriptTransform(p)
+case p: Union => visitUnion(p)
+case p: ResolvedHint => visitHint(p)
--- End diff --

It sounds like they are sorted by the name of logical operators. We can 
adjust the order later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18479: [SPARK-21273][SQL] Propagate logical plan stats u...

2017-06-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/18479#discussion_r125153296
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlanVisitor.scala
 ---
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.plans.logical
+
+/**
+ * A visitor pattern for traversing a [[LogicalPlan]] tree and compute 
some properties.
+ */
+trait LogicalPlanVisitor[T] {
+
+  def visit(p: LogicalPlan): T = p match {
+case p: Aggregate => visitAggregate(p)
+case p: Distinct => visitDistinct(p)
+case p: Except => visitExcept(p)
+case p: Expand => visitExpand(p)
+case p: Filter => visitFilter(p)
+case p: Generate => visitGenerate(p)
+case p: GlobalLimit => visitGlobalLimit(p)
+case p: Intersect => visitIntersect(p)
+case p: Join => visitJoin(p)
+case p: LocalLimit => visitLocalLimit(p)
+case p: Pivot => visitPivot(p)
+case p: Project => visitProject(p)
+case p: Range => visitRange(p)
+case p: Repartition => visitRepartition(p)
+case p: RepartitionByExpression => visitRepartitionByExpr(p)
+case p: Sample => visitSample(p)
+case p: ScriptTransformation => visitScriptTransform(p)
+case p: Union => visitUnion(p)
+case p: ResolvedHint => visitHint(p)
+case p: LogicalPlan => default(p)
--- End diff --

Since `LogicalPlan` already covers all the other cases, it is fine to cover 
the limited operators in the current stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18388
  
cc @zsxwing how strictly we require for shuffle service compatibility?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...

2017-06-30 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/18388
  
Yes, there is a change. Server side may return `OpenBlocksFailed` for the 
"open blocks" request, which means that old client is not compatible with new 
server. Is it acceptable ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18482: [SPARK-21262] Stop sending 'stream request' when shuffle...

2017-06-30 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/18482
  
In current change, it i fetching big chunk in memory and then writing to 
disk and then release the memory. I made this change for below reasons:
1. The client shouldn't break old shuffle service, thus cannot send "stream 
request" to server. We have to send `ChunkFetchRequest` and handle the 
`ChunkFetchSuccess` for response.
2. It's hard to make 'ChunkFetchSuccess' to be a stream and read it to 
disk. We need to implement another `TransportFrameDecoder`, which I think is 
too much cost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79008 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79008/testReport)**
 for PR 18444 at commit 
[`930d16b`](https://github.com/apache/spark/commit/930d16bc26128584bcd6e1194f1340f1bba86fc9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18301: [SPARK-21052][SQL] Add hash map metrics to join

2017-06-30 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18301
  

https://github.com/apache/spark/blob/fd1325522549937232f37215db53d6478f48644c/core/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L473

In the a probe at L473, if the slot pointed by the hash code is not empty, 
it's possible that there's hash collision (equal hash codes, different keys), 
but it's possible too that the slot is occupied by a key with different hash 
(the if condition at L475 is false). In this case, we continue to look up an 
empty slot by going forward `step` at 492, and increase the number of probe.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79006 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79006/testReport)**
 for PR 18444 at commit 
[`bd8e111`](https://github.com/apache/spark/commit/bd8e11176fb11eb1b7333439fa19ae2f19eaab76).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18444
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79006/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18444: [SPARK-16542][SQL][PYSPARK] Fix bugs about types that re...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18444
  
**[Test build #79006 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79006/testReport)**
 for PR 18444 at commit 
[`bd8e111`](https://github.com/apache/spark/commit/bd8e11176fb11eb1b7333439fa19ae2f19eaab76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18307: [SPARK-21100][SQL] describe should give quartiles simila...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18307
  
**[Test build #79007 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79007/testReport)**
 for PR 18307 at commit 
[`cba1b0e`](https://github.com/apache/spark/commit/cba1b0e6c3ac32f7cb327ead54f6e8307aed00ac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79004/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18023
  
**[Test build #79004 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79004/testReport)**
 for PR 18023 at commit 
[`448c3e2`](https://github.com/apache/spark/commit/448c3e2d200ad9530cfd43e8200afc7b7b7f1469).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18334: [SPARK-21127] [SQL] Update statistics after data ...

2017-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18334


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18334: [SPARK-21127] [SQL] Update statistics after data changin...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18334
  
thanks, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18159: [SPARK-20703][SQL] Associate metrics with data wr...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18159#discussion_r125151880
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -19,24 +19,65 @@ package org.apache.spark.sql.execution.command
 
 import java.util.UUID
 
+import org.apache.spark.SparkContext
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Row, SparkSession}
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
 import org.apache.spark.sql.catalyst.errors.TreeNodeException
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
 import org.apache.spark.sql.catalyst.plans.{logical, QueryPlan}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
-import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.{SparkPlan, SQLExecution}
+import org.apache.spark.sql.execution.datasources.ExecutedWriteSummary
 import org.apache.spark.sql.execution.debug._
+import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.sql.execution.streaming.{IncrementalExecution, 
OffsetSeqMetadata}
 import org.apache.spark.sql.streaming.OutputMode
 import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
 
 /**
  * A logical command that is executed for its side-effects.  
`RunnableCommand`s are
  * wrapped in `ExecutedCommand` during execution.
  */
 trait RunnableCommand extends logical.Command {
+
+  // The map used to record the metrics of running the command. This will 
be passed to
+  // `ExecutedCommand` during query planning.
+  private[sql] lazy val metrics: Map[String, SQLMetric] = Map.empty
+
+  /**
+   * Callback function that update metrics collected from the writing 
operation.
+   */
+  protected def callbackMetricsUpdater(writeSummaries: 
Seq[ExecutedWriteSummary]): Unit = {
--- End diff --

I think it's more reasonable to do this in `InsertIntoHadoopFsRelation`. If 
you are worried about duplicated code,  we can create a trait for 
`InsertIntoHadoopFsRelation` and `InsertIntoHive`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18159: [SPARK-20703][SQL] Associate metrics with data wr...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18159#discussion_r125151815
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 
---
@@ -19,24 +19,65 @@ package org.apache.spark.sql.execution.command
 
 import java.util.UUID
 
+import org.apache.spark.SparkContext
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.{Row, SparkSession}
 import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow}
 import org.apache.spark.sql.catalyst.errors.TreeNodeException
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
 import org.apache.spark.sql.catalyst.plans.{logical, QueryPlan}
 import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
-import org.apache.spark.sql.execution.SparkPlan
+import org.apache.spark.sql.execution.{SparkPlan, SQLExecution}
+import org.apache.spark.sql.execution.datasources.ExecutedWriteSummary
 import org.apache.spark.sql.execution.debug._
+import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics}
 import org.apache.spark.sql.execution.streaming.{IncrementalExecution, 
OffsetSeqMetadata}
 import org.apache.spark.sql.streaming.OutputMode
 import org.apache.spark.sql.types._
+import org.apache.spark.util.Utils
 
 /**
  * A logical command that is executed for its side-effects.  
`RunnableCommand`s are
  * wrapped in `ExecutedCommand` during execution.
  */
 trait RunnableCommand extends logical.Command {
+
+  // The map used to record the metrics of running the command. This will 
be passed to
+  // `ExecutedCommand` during query planning.
+  private[sql] lazy val metrics: Map[String, SQLMetric] = Map.empty
--- End diff --

usually we don't need `private[sql]` under the `execution` package.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18483: [SPARK-17528][SQL] data should be copied properly...

2017-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18483


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18483: [SPARK-17528][SQL] data should be copied properly before...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18483
  
thanks for the review, merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18388: [SPARK-21175] Reject OpenBlocks when memory shortage on ...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18388
  
does this patch require server side change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18495
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18495
  
**[Test build #79005 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79005/testReport)**
 for PR 18495 at commit 
[`4fe7641`](https://github.com/apache/spark/commit/4fe7641c200dffe416ef6bd84c87f778bba5c799).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18495
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79005/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18482: [SPARK-21262] Stop sending 'stream request' when shuffle...

2017-06-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/18482
  
does this mean we have to fetch big chunks in memory and then writing to 
disk?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18480: [SPARK-21052][SQL][Follow-up] Add hash map metrics to jo...

2017-06-30 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/18480
  
Thanks @gatorsmile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18476: [SPARK-20858][DOC][MINOR] Document ListenerBus event que...

2017-06-30 Thread sadikovi
Github user sadikovi commented on the issue:

https://github.com/apache/spark/pull/18476
  
@JoshRosen Thank you for the comment! I updated config option name to 
reflect changes in master branch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18023
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/79003/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18023
  
**[Test build #79003 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79003/testReport)**
 for PR 18023 at commit 
[`4e36ed9`](https://github.com/apache/spark/commit/4e36ed903973dcf637348825b5726892f2c13f77).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class UnresolvedRegex(regexPattern: String, table: 
Option[String], caseSensitive: Boolean)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18436: [SPARK-20073][SQL] Prints an explicit warning message in...

2017-06-30 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/18436
  
ok, done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18465: [SPARK-21093][R] Terminate R's worker processes in the p...

2017-06-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/18465
  
FWIW, I don't know but I guess it happens randomly in the middle of any 
tests. My wild guess is it is related with triggering many tests (or maybe 
rebasing a lot to trigger the build). I saw it broke all other builds with -9 
in all other PRs when it happens once. Resently I mapped the unknown codes 
directly to be printed as above because I don't know the reason.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18495
  
**[Test build #79005 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79005/testReport)**
 for PR 18495 at commit 
[`4fe7641`](https://github.com/apache/spark/commit/4fe7641c200dffe416ef6bd84c87f778bba5c799).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18476: [SPARK-20858][DOC][MINOR] Document ListenerBus ev...

2017-06-30 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request:

https://github.com/apache/spark/pull/18476#discussion_r125149417
  
--- Diff: docs/configuration.md ---
@@ -1398,6 +1398,15 @@ Apart from these, the following properties are also 
available, and may be useful
   
 
 
+  spark.scheduler.listenerbus.eventqueue.size
--- End diff --

If you're only documenting this in master then please use 
`spark.scheduler.listenerbus.eventqueue.capacity` instead (see definition in 
code and git blame for explanation).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18495: [SPARK-21275][ML] Update GLM test to use supporte...

2017-06-30 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/18495

[SPARK-21275][ML] Update GLM test to use supportedFamilyNames

## What changes were proposed in this pull request?
Update GLM test to use supportedFamilyNames as suggested here:
https://github.com/apache/spark/pull/16699#discussion-diff-100574976R855

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark mlGlmTest2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18495.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18495


commit 4fe7641c200dffe416ef6bd84c87f778bba5c799
Author: actuaryzhang 
Date:   2017-07-01T00:12:55Z

Update GLM test to use supportedFamilyNames




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18495: [SPARK-21275][ML] Update GLM test to use supportedFamily...

2017-06-30 Thread actuaryzhang
Github user actuaryzhang commented on the issue:

https://github.com/apache/spark/pull/18495
  
@yanboliang 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11106: [SPARK-13225] [SQL] Support Intersect All/Distinct [WIP]

2017-06-30 Thread Tagar
Github user Tagar commented on the issue:

https://github.com/apache/spark/pull/11106
  
another possible way to implement INTERSECT ALL
https://issues.apache.org/jira/browse/SPARK-21274


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-06-30 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18459
  
Great, thanks @shaneknapp! Just let me know what I can do to help

On Jun 30, 2017 4:52 PM, "shane"  wrote:

> i won't have time to think about and do something until monday... but i
> have some ideas.
>
> On Fri, Jun 30, 2017 at 4:29 PM, Bryan Cutler 
> wrote:
>
> > Thanks for checking on that Wes! @shaneknapp
> >  and @holdenk  >
> > I definitely don't want you to go through dependency hell... 👎 I'm 
not
> > too sure how to resolve this since I have no access to Jenkins. At this
> > point would it be better to try setting these tests up through a virtual
> > env again?
> >
> > —
> > You are receiving this because you were mentioned.
> > Reply to this email directly, view it on GitHub
> > , or
> mute
> > the thread
> >  jJhXfFO6BNh8ZAmIxtcWks5sJYT3gaJpZM4OITbk>
> > .
> >
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18023
  
**[Test build #79004 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79004/testReport)**
 for PR 18023 at commit 
[`448c3e2`](https://github.com/apache/spark/commit/448c3e2d200ad9530cfd43e8200afc7b7b7f1469).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-06-30 Thread shaneknapp
Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/18459
  
i won't have time to think about and do something until monday...  but i
have some ideas.

On Fri, Jun 30, 2017 at 4:29 PM, Bryan Cutler 
wrote:

> Thanks for checking on that Wes! @shaneknapp
>  and @holdenk 
> I definitely don't want you to go through dependency hell... 👎 I'm not
> too sure how to resolve this since I have no access to Jenkins. At this
> point would it be better to try setting these tests up through a virtual
> env again?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14431: [SPARK-16258][SparkR] Automatically append the grouping ...

2017-06-30 Thread NarineK
Github user NarineK commented on the issue:

https://github.com/apache/spark/pull/14431
  
I think @falaki's approach is good, only I find the key which is passed as 
an argument together with x as an input of function is a little superfluous.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18023: [SPARK-12139] [SQL] REGEX Column Specification

2017-06-30 Thread janewangfb
Github user janewangfb commented on a diff in the pull request:

https://github.com/apache/spark/pull/18023#discussion_r125147853
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -847,6 +847,11 @@ object SQLConf {
   .intConf
   
.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)
 
+  val SUPPORT_QUOTED_REGEX_COLUMN_NAME = 
buildConf("spark.sql.parser.quotedRegexColumnNames")
+.doc("When true, a SELECT statement can take regex-based column 
specification.")
--- End diff --

updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18465: [SPARK-21093][R] Terminate R's worker processes in the p...

2017-06-30 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/18465
  
@felixcheung are these failures happening from the gapply tests ? Also do 
we have a way to map the error code to an error reason ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18459: [SPARK-13534][PYSPARK] Using Apache Arrow to increase pe...

2017-06-30 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/18459
  
Thanks for checking on that Wes!  @shaneknapp and @holdenk I definitely 
don't want you to go through dependency hell... :-1:   I'm not too sure how to 
resolve this since I have no access to Jenkins.  At this point would it be 
better to try setting these tests up through a virtual env again?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16158: [SPARK-18724][ML] Add TuningSummary for TrainValidationS...

2017-06-30 Thread hhbyyh
Github user hhbyyh commented on the issue:

https://github.com/apache/spark/pull/16158
  
@MLnick Thanks for your attention. I'm not sure if SPARK-19053 is still 
active and maybe it's not a blocking issue for this change. If you don't mind, 
I'll extend the jira/PR scope to involve CrossValidator to have an integrated 
improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18307: [SPARK-21100][SQL] describe should give quartiles...

2017-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18307#discussion_r125146093
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2205,37 +2205,151 @@ class Dataset[T] private[sql](
*   // max 92.0  192.0
* }}}
*
+   * See also [[summary]]
+   *
+   * @param cols Columns to compute statistics on.
+   *
* @group action
* @since 1.6.0
*/
   @scala.annotation.varargs
-  def describe(cols: String*): DataFrame = withPlan {
+  def describe(cols: String*): DataFrame = {
+val selected = if (cols.isEmpty) this else select(cols.head, 
cols.tail: _*)
+selected.summary("count", "mean", "stddev", "min", "max")
+  }
+
+  /**
+   * Computes specified statistics for numeric and string columns. 
Available statistics are:
--- End diff --

I'd also give an example of how to compute summary for specific columns.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18307: [SPARK-21100][SQL] describe should give quartiles...

2017-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18307#discussion_r125146112
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2205,37 +2205,151 @@ class Dataset[T] private[sql](
*   // max 92.0  192.0
* }}}
*
+   * See also [[summary]]
+   *
+   * @param cols Columns to compute statistics on.
+   *
* @group action
* @since 1.6.0
*/
   @scala.annotation.varargs
-  def describe(cols: String*): DataFrame = withPlan {
+  def describe(cols: String*): DataFrame = {
+val selected = if (cols.isEmpty) this else select(cols.head, 
cols.tail: _*)
+selected.summary("count", "mean", "stddev", "min", "max")
+  }
+
+  /**
+   * Computes specified statistics for numeric and string columns. 
Available statistics are:
--- End diff --

and explain the difference between describe and summary (basically summary 
seems easier to use).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18307: [SPARK-21100][SQL] describe should give quartiles...

2017-06-30 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18307#discussion_r125146063
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2205,37 +2205,151 @@ class Dataset[T] private[sql](
*   // max 92.0  192.0
* }}}
*
+   * See also [[summary]]
+   *
+   * @param cols Columns to compute statistics on.
+   *
* @group action
* @since 1.6.0
*/
   @scala.annotation.varargs
-  def describe(cols: String*): DataFrame = withPlan {
+  def describe(cols: String*): DataFrame = {
+val selected = if (cols.isEmpty) this else select(cols.head, 
cols.tail: _*)
+selected.summary("count", "mean", "stddev", "min", "max")
+  }
+
+  /**
+   * Computes specified statistics for numeric and string columns. 
Available statistics are:
+   *
+   * - count
+   * - mean
+   * - stddev
+   * - min
+   * - max
+   * - arbitrary approximate percentiles specified as a percentage (eg, 
75%)
+   *
+   * If no statistics are given, this function computes count, mean, 
stddev, min,
+   * approximate quartiles, and max.
+   *
+   * This function is meant for exploratory data analysis, as we make no 
guarantee about the
+   * backward compatibility of the schema of the resulting Dataset. If you 
want to
+   * programmatically compute summary statistics, use the `agg` function 
instead.
+   *
+   * {{{
+   *   ds.summary().show()
+   *
+   *   // output:
+   *   // summary age   height
+   *   // count   10.0  10.0
+   *   // mean53.3  178.05
+   *   // stddev  11.6  15.7
+   *   // min 18.0  163.0
+   *   // 25% 24.0  176.0
+   *   // 50% 24.0  176.0
+   *   // 75% 32.0  180.0
+   *   // max 92.0  192.0
+   * }}}
+   *
+   * {{{
+   *   ds.summary("count", "min", "25%", "75%", "max").show()
+   *
+   *   // output:
+   *   // summary age   height
+   *   // count   10.0  10.0
+   *   // min 18.0  163.0
+   *   // 25% 24.0  176.0
+   *   // 75% 32.0  180.0
+   *   // max 92.0  192.0
+   * }}}
+   *
+   * @param statistics Statistics from above list to be computed.
+   *
+   * @group action
+   * @since 2.3.0
+   */
+  @scala.annotation.varargs
+  def summary(statistics: String*): DataFrame = withPlan {
--- End diff --

can we move the implementation into 
org.apache.spark.sql.execution.stat.StatFunctions? I worry Dataset is getting 
too long. It should probably be mostly an interface / delegation and most of 
the implementations are elsewhere.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >