[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...

2017-09-11 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19188#discussion_r138009182
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
 ---
@@ -113,12 +114,40 @@ object TPCDSQueryBenchmark {
   "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
   "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+val sparkConf = new SparkConf()
+  .setMaster("local[1]")
+  .setAppName("test-sql-context")
+  .set("spark.sql.parquet.compression.codec", "snappy")
+  .set("spark.sql.shuffle.partitions", "4")
+  .set("spark.driver.memory", "3g")
+  .set("spark.executor.memory", "3g")
+  .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 
1024).toString)
+  .set("spark.sql.crossJoin.enabled", "true")
+
+// If `spark.sql.tpcds.queryFilter` defined, this class filters the 
queries that
--- End diff --

nit: `class` -> `variable`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19147
  
**[Test build #81635 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81635/testReport)**
 for PR 19147 at commit 
[`803054e`](https://github.com/apache/spark/commit/803054e9f30b057e1b194b5625cb6216d865f1d4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19131: [MINOR][SQL]remove unuse import class

2017-09-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19131
  
I'm going to merge it as a cleanup but yeah let's not do these often.
I would favor adding a style check for this if one can be found, but don't 
see it in scalastyle.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19147
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81629/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19169: [SPARK-21957][SQL] Add current_user function

2017-09-11 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19169
  
On local I solved the build error by running a `mvn clean`. As pointed out 
by @maropu , a PR removed the class and then the incremental compilation fails. 
I am not sure why this is happening on jenkins though...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

2017-09-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19147#discussion_r138012166
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
 ---
@@ -62,6 +62,7 @@ import org.apache.spark.util.Utils
  */
 case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: SparkPlan)
--- End diff --

Thanks! Let's see if others have any suggestions for a while.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-09-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/15178
  
@rxin Do we still consider to incorporate this broadcast on executor 
feature? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18337: [SPARK-21131][GraphX] Fix batch gradient bug in SVDPlusP...

2017-09-11 Thread lxmly
Github user lxmly commented on the issue:

https://github.com/apache/spark/pull/18337
  
which dataset? @daniellaah 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138023290
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
+  "Number of clusters must be greater than one.")
+
+$(metricName) match {
+  case "squaredSilhouette" => 
SquaredEuclideanSilhouette.computeSilhouetteScore(
+dataset,
+$(predictionCol),
+$(featuresCol)
+  )
+}
+  }
+}
+
+
+@Since("2.3.0")
+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  @Since("2.3.0")
+  override def load(path: String): ClusteringEvaluator = super.load(path)
+
+}
+
+
+/**
+ * SquaredEuclideanSilhouette computes the average of 

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138027427
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
--- End diff --

I'd suggest the metric name is ```silhouette```, since we may add 
silhouette for other distance, then we can add another param like 
```distance``` to control that. The param ```metricName``` should not bind to 
any distance computation way. There are lots of other metrics for clustering 
algorithms, like 
[these](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics) 
in sklearn. We would not add all of them for MLlib, but we may add part of them 
in the future.
cc @jkbradley @MLnick @WeichenXu123 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138024385
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
+  "Number of clusters must be greater than one.")
+
+$(metricName) match {
+  case "squaredSilhouette" => 
SquaredEuclideanSilhouette.computeSilhouetteScore(
+dataset,
+$(predictionCol),
+$(featuresCol)
+  )
+}
+  }
+}
+
+
+@Since("2.3.0")
+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  @Since("2.3.0")
+  override def load(path: String): ClusteringEvaluator = super.load(path)
+
+}
+
+
+/**
+ * SquaredEuclideanSilhouette computes the average of 

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138025184
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
+  "Number of clusters must be greater than one.")
+
+$(metricName) match {
+  case "squaredSilhouette" => 
SquaredEuclideanSilhouette.computeSilhouetteScore(
+dataset,
+$(predictionCol),
+$(featuresCol)
+  )
+}
+  }
+}
+
+
+@Since("2.3.0")
+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  @Since("2.3.0")
+  override def load(path: String): ClusteringEvaluator = super.load(path)
+
+}
+
+
+/**
+ * SquaredEuclideanSilhouette computes the average of 

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138025640
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
--- End diff --

Move this check to L418, in case another unnecessary computation for most 
of the cases(cluster size > 1). See my comment at L418.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138024573
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
+  "Number of clusters must be greater than one.")
+
+$(metricName) match {
+  case "squaredSilhouette" => 
SquaredEuclideanSilhouette.computeSilhouetteScore(
+dataset,
+$(predictionCol),
+$(featuresCol)
+  )
+}
+  }
+}
+
+
+@Since("2.3.0")
+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  @Since("2.3.0")
+  override def load(path: String): ClusteringEvaluator = super.load(path)
+
+}
+
+
+/**
+ * SquaredEuclideanSilhouette computes the average of 

[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18538#discussion_r138021102
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala 
---
@@ -0,0 +1,437 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.evaluation
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.broadcast.Broadcast
+import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, 
VectorUDT}
+import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol}
+import org.apache.spark.ml.util.{DefaultParamsReadable, 
DefaultParamsWritable, Identifiable, SchemaUtils}
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.functions.{avg, col, udf}
+import org.apache.spark.sql.types.IntegerType
+
+/**
+ * :: Experimental ::
+ * Evaluator for clustering results.
+ * The metric computes the Silhouette measure
+ * using the squared Euclidean distance.
+ *
+ * The Silhouette is a measure for the validation
+ * of the consistency within clusters. It ranges
+ * between 1 and -1, where a value close to 1
+ * means that the points in a cluster are close
+ * to the other points in the same cluster and
+ * far from the points of the other clusters.
+ */
+@Experimental
+@Since("2.3.0")
+class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val 
uid: String)
+  extends Evaluator with HasPredictionCol with HasFeaturesCol with 
DefaultParamsWritable {
+
+  @Since("2.3.0")
+  def this() = this(Identifiable.randomUID("cluEval"))
+
+  @Since("2.3.0")
+  override def copy(pMap: ParamMap): ClusteringEvaluator = 
this.defaultCopy(pMap)
+
+  @Since("2.3.0")
+  override def isLargerBetter: Boolean = true
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setPredictionCol(value: String): this.type = set(predictionCol, 
value)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /**
+   * param for metric name in evaluation
+   * (supports `"squaredSilhouette"` (default))
+   * @group param
+   */
+  @Since("2.3.0")
+  val metricName: Param[String] = {
+val allowedParams = ParamValidators.inArray(Array("squaredSilhouette"))
+new Param(
+  this,
+  "metricName",
+  "metric name in evaluation (squaredSilhouette)",
+  allowedParams
+)
+  }
+
+  /** @group getParam */
+  @Since("2.3.0")
+  def getMetricName: String = $(metricName)
+
+  /** @group setParam */
+  @Since("2.3.0")
+  def setMetricName(value: String): this.type = set(metricName, value)
+
+  setDefault(metricName -> "squaredSilhouette")
+
+  @Since("2.3.0")
+  override def evaluate(dataset: Dataset[_]): Double = {
+SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new 
VectorUDT)
+SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), 
IntegerType)
+
+// Silhouette is reasonable only when the number of clusters is grater 
then 1
+assert(dataset.select($(predictionCol)).distinct().count() > 1,
+  "Number of clusters must be greater than one.")
+
+$(metricName) match {
+  case "squaredSilhouette" => 
SquaredEuclideanSilhouette.computeSilhouetteScore(
+dataset,
+$(predictionCol),
+$(featuresCol)
+  )
+}
+  }
+}
+
+
+@Since("2.3.0")
+object ClusteringEvaluator
+  extends DefaultParamsReadable[ClusteringEvaluator] {
+
+  @Since("2.3.0")
+  override def load(path: String): ClusteringEvaluator = super.load(path)
+
+}
+
+
+/**
+ * SquaredEuclideanSilhouette computes the average of 

[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81632 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81632/testReport)**
 for PR 18875 at commit 
[`8835aca`](https://github.com/apache/spark/commit/8835aca5a5ccf420010b24e43189bfa8f1e8dc2a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81627/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16677
  
**[Test build #81627 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81627/testReport)**
 for PR 16677 at commit 
[`f2a7aac`](https://github.com/apache/spark/commit/f2a7aacc22caf27c1a8af612c9432586e6a86d17).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread goldmedal
Github user goldmedal commented on the issue:

https://github.com/apache/spark/pull/18875
  
@HyukjinKwon We have finished the `MapType` and `ArrayType` of `MapType`s 
supporting. Please take a look when you are available. Thanks :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19147
  
**[Test build #81629 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81629/testReport)**
 for PR 19147 at commit 
[`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19147
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

2017-09-11 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19147#discussion_r138003592
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
 ---
@@ -62,6 +62,7 @@ import org.apache.spark.util.Utils
  */
 case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: SparkPlan)
--- End diff --

How about `BlockedEvalPythonExec` or something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17819
  
**[Test build #81634 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81634/testReport)**
 for PR 17819 at commit 
[`f8dedd1`](https://github.com/apache/spark/commit/f8dedd1c92a8c48358743626b99c2f2192bc09b1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81631/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17819
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81634/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17819
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...

2017-09-11 Thread jinxing64
Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/19086
  
@gatorsmile
OK and thanks a lot for review :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19184
  
@rajeshbalamohan Thanks for updating. I think we need a complete fix as 
previous comments from the reviewers @jerryshao @kiszk @jiangxb1987 suggested. 
Can you try to fix this according to the comments? Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org




[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-11 Thread fjh100456
GitHub user fjh100456 opened a pull request:

https://github.com/apache/spark/pull/19189

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration 
doesn't take effect on tables with partition field(s)

## What changes were proposed in this pull request?
Pass the spark configuration ‘spark.sql.parquet.compression.codec’ to 
hive so that it can take effect.

## How was this patch tested?
Manually checked.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fjh100456/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19189.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19189


commit f91ed07718d4462e487237353bf82c80c5e148f7
Author: fjh100456 
Date:   2017-06-19T08:53:06Z

[SPARK-21135][WEB UI] On history server page,duration of incompleted 
applications should be hidden instead of showing up as 0

## What changes were proposed in this pull request?

Hide duration of incomplete applications.

## How was this patch tested?

manual tests

commit faffde55e38d1db9f54338f29429f98baf73def5
Author: fjh100456 
Date:   2017-06-30T06:15:01Z

Merge pull request #1 from apache/master

更新代码到2017-06-30

commit 44ff604cb6fca37acafd018c3e25888fbe1df1ae
Author: fjh100456 
Date:   2017-09-08T09:02:59Z

Merge pull request #2 from apache/master

Merge spark master




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19189
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19174: [SPARK-21963][CORE][TEST]Create temp file should be dele...

2017-09-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19174
  
Same, let's not bother with stuff this trivial @heary-cao please. 
If it really makes the code consistent on this one point, I'm not against 
this, other than that it encourages more PRs this small.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81626/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81626 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81626/testReport)**
 for PR 19186 at commit 
[`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16677
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

2017-09-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/17819
  
ping @MLnick Can you have time to help review this recently? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...

2017-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19147#discussion_r138010327
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala
 ---
@@ -62,6 +62,7 @@ import org.apache.spark.util.Utils
  */
 case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: 
Seq[Attribute], child: SparkPlan)
--- End diff --

I feel it is better than `BatchEvalPythonExec`. Don't know if others have 
any suggestion.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...

2017-09-11 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19188#discussion_r138010133
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala
 ---
@@ -113,12 +114,40 @@ object TPCDSQueryBenchmark {
   "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90",
   "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99")
 
+val sparkConf = new SparkConf()
+  .setMaster("local[1]")
+  .setAppName("test-sql-context")
+  .set("spark.sql.parquet.compression.codec", "snappy")
+  .set("spark.sql.shuffle.partitions", "4")
+  .set("spark.driver.memory", "3g")
+  .set("spark.executor.memory", "3g")
+  .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 
1024).toString)
+  .set("spark.sql.crossJoin.enabled", "true")
+
+// If `spark.sql.tpcds.queryFilter` defined, this class filters the 
queries that
+// this option selects.
+val queryFilter = sparkConf
+  
.getOption("spark.sql.tpcds.queryFilter").map(_.split(",").map(_.trim).toSet)
+  .getOrElse(Set.empty)
+
+val queriesToRun = if (queryFilter.nonEmpty) {
+  val queries = tpcdsAllQueries.filter { case queryName => 
queryFilter.contains(queryName) }
+  if (queries.isEmpty) {
+throw new RuntimeException("Bad query name filter: " + queryFilter)
+  }
+  queries
+} else {
+  tpcdsAllQueries
+}
+
 // In order to run this benchmark, please follow the instructions at
 // https://github.com/databricks/spark-sql-perf/blob/master/README.md 
to generate the TPCDS data
 // locally (preferably with a scale factor of 5 for benchmarking). 
Thereafter, the value of
 // dataLocation below needs to be set to the location where the 
generated data is stored.
 val dataLocation = ""
 
-tpcdsAll(dataLocation, queries = tpcdsQueries)
+val spark = SparkSession.builder.config(sparkConf).getOrCreate()
+val tpcdsQueries = TpcdsQueries(spark, queries = queriesToRun, 
dataLocation)
--- End diff --

nit: Do we need `queries =`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15178
  
**[Test build #81636 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81636/testReport)**
 for PR 15178 at commit 
[`93ecabb`](https://github.com/apache/spark/commit/93ecabb7c7c9b2fa42a321e28f834568cadf0272).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...

2017-09-11 Thread fjh100456
Github user fjh100456 closed the pull request at:

https://github.com/apache/spark/pull/19189


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...

2017-09-11 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/19172
  
LGTM2, merged into master. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...

2017-09-11 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/19182
  
Some of these 'throws' clauses may not be removable because they cause 
callers that catch the checked exception to fail to compile. 

Removing "throws Exception" in tests isn't obviously helpful, because it 
means the signature has to change any time the tested code happens to throw a 
new checked exception. There's no API problem as nothing invokes these methods 
directly.

Private non-test methods, maybe.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18732
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...

2017-09-11 Thread caneGuy
Github user caneGuy commented on the issue:

https://github.com/apache/spark/pull/19132
  
@jerryshao Thanks for your time. IIUC, event log is completed since driver 
has not dropped any event of executor which has problem described above.See 
below,driver only drop two events after shutting down:
`2017-09-11,17:08:57,605 ERROR org.apache.spark.scheduler.LiveListenerBus: 
SparkListenerBus has already stopped! Dropping event 
SparkListenerExecutorMetricsUpdate(300,WrappedArray())
2017-09-11,17:08:57,623 INFO org.apache.spark.deploy.yarn.YarnAllocator: 
Driver requested a total number of 0 executor(s).
2017-09-11,17:08:57,624 INFO 
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down 
all executors
2017-09-11,17:08:57,624 INFO 
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each 
executor to shut down
2017-09-11,17:08:57,629 ERROR org.apache.spark.scheduler.LiveListenerBus: 
SparkListenerBus has already stopped! Dropping event 
SparkListenerExecutorMetricsUpdate(258,WrappedArray())`

And below showing an executor has this problem:

![running](https://user-images.githubusercontent.com/26762018/30268618-ea27e8bc-9718-11e7-9f40-3071116255eb.png)

![running02](https://user-images.githubusercontent.com/26762018/30268619-ea318c6e-9718-11e7-968d-ba2174e2f079.png)

Case here is a job running today of our cluster.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81631 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81631/testReport)**
 for PR 18875 at commit 
[`be62e51`](https://github.com/apache/spark/commit/be62e51109c8a438fed8d787fb2986aeb1974f82).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18875
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81632/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19184
  
**[Test build #81628 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81628/testReport)**
 for PR 19184 at commit 
[`ea5f9d9`](https://github.com/apache/spark/commit/ea5f9d9903185690670da6384dbd6d2b08b8177f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19184
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81628/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19184
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19147
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18538
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18538
  
**[Test build #81639 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)**
 for PR 18538 at commit 
[`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18538
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81639/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19190: [SPARK-21976][DOC]

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19190
  
**[Test build #3917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3917/testReport)**
 for PR 19190 at commit 
[`a95cfc6`](https://github.com/apache/spark/commit/a95cfc6c5e88f44429319b52462b537ae0bc1857).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138047016
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -603,6 +614,112 @@ def featuresCol(self):
 """
 return self._call_java("featuresCol")
 
+@property
+@since("2.3.0")
+def labels(self):
+"""
+Returns the sequence of labels in ascending order. This order 
matches the order used
+in metrics which are specified as arrays over labels, e.g., 
truePositiveRateByLabel.
+
+Note: In most cases, it will be values {0.0, 1.0, ..., 
numClasses-1}, However, if the
+training set is missing a label, then all of the arrays over labels
+(e.g., from truePositiveRateByLabel) will be of length 
numClasses-1 instead of the
+expected numClasses.
+"""
+return self._call_java("labels")
+
+@property
+@since("2.3.0")
+def truePositiveRateByLabel(self):
+"""
+Returns true positive rate for each label (category).
+"""
+return self._call_java("truePositiveRateByLabel")
+
+@property
+@since("2.3.0")
+def falsePositiveRateByLabel(self):
+"""
+Returns false positive rate for each label (category).
+"""
+return self._call_java("falsePositiveRateByLabel")
+
+@property
+@since("2.3.0")
+def precisionByLabel(self):
+"""
+Returns precision for each label (category).
+"""
+return self._call_java("precisionByLabel")
+
+@property
+@since("2.3.0")
+def recallByLabel(self):
+"""
+Returns recall for each label (category).
+"""
+return self._call_java("recallByLabel")
+
+@property
+@since("2.3.0")
+def fMeasureByLabel(self, beta=1.0):
+"""
+Returns f-measure for each label (category).
+"""
+return self._call_java("fMeasureByLabel", beta)
+
+@property
+@since("2.3.0")
+def accuracy(self):
+"""
+Returns accuracy.
+(equals to the total number of correctly classified instances
+out of the total number of instances.)
+"""
+return self._call_java("accuracy")
+
+@property
+@since("2.3.0")
+def weightedTruePositiveRate(self):
+"""
+Returns weighted true positive rate.
+(equals to precision, recall and f-measure)
+"""
+return self._call_java("weightedTruePositiveRate")
+
+@property
+@since("2.3.0")
+def weightedFalsePositiveRate(self):
+"""
+Returns weighted false positive rate.
+"""
+return self._call_java("weightedFalsePositiveRate")
+
+@property
+@since("2.3.0")
+def weightedRecall(self):
+"""
+Returns weighted averaged recall.
+(equals to precision, recall and f-measure)
+"""
+return self._call_java("weightedRecall")
+
+@property
+@since("2.3.0")
+def weightedPrecision(self):
+"""
+Returns weighted averaged precision.
+"""
+return self._call_java("weightedPrecision")
+
+@property
--- End diff --

Remove this annotation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138048004
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1478,6 +1478,40 @@ def test_logistic_regression_summary(self):
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
+def test_multiclass_logistic_regression_summary(self):
+df = self.spark.createDataFrame([(1.0, 2.0, Vectors.dense(1.0)),
+ (0.0, 2.0, Vectors.sparse(1, [], 
[])),
+ (2.0, 2.0, Vectors.dense(2.0)),
+ (2.0, 2.0, Vectors.dense(1.9))],
+["label", "weight", "features"])
+lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight", fitIntercept=False)
+model = lr.fit(df)
+self.assertTrue(model.hasSummary)
+s = model.summary
+# test that api is callable and returns expected types
+self.assertTrue(isinstance(s.predictions, DataFrame))
+self.assertEqual(s.probabilityCol, "probability")
+self.assertEqual(s.labelCol, "label")
+self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
+objHist = s.objectiveHistory
+self.assertTrue(isinstance(objHist, list) and 
isinstance(objHist[0], float))
+self.assertGreater(s.totalIterations, 0)
+self.assertTrue(isinstance(s.labels, list))
+self.assertTrue(isinstance(s.truePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.falsePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.precisionByLabel, list))
+self.assertTrue(isinstance(s.recallByLabel, list))
+self.assertTrue(isinstance(s.fMeasureByLabel, list))
+self.assertAlmostEqual(s.accuracy, 0.75, 2)
+self.assertAlmostEqual(s.weightedTruePositiveRate, 0.75, 2)
+self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.25, 2)
+self.assertAlmostEqual(s.weightedRecall, 0.75, 2)
+self.assertAlmostEqual(s.weightedPrecision, 0.583, 2)
+self.assertAlmostEqual(s.weightedFMeasure, 0.65, 2)
--- End diff --

We need to add these check for the above 
```test_logistic_regression_summary``` and rename it to 
```test_binary_logistic_regression_summary```, since binary logistic regression 
summary has these variables as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138046547
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -603,6 +614,112 @@ def featuresCol(self):
 """
 return self._call_java("featuresCol")
 
+@property
+@since("2.3.0")
+def labels(self):
+"""
+Returns the sequence of labels in ascending order. This order 
matches the order used
+in metrics which are specified as arrays over labels, e.g., 
truePositiveRateByLabel.
+
+Note: In most cases, it will be values {0.0, 1.0, ..., 
numClasses-1}, However, if the
+training set is missing a label, then all of the arrays over labels
+(e.g., from truePositiveRateByLabel) will be of length 
numClasses-1 instead of the
+expected numClasses.
+"""
+return self._call_java("labels")
+
+@property
+@since("2.3.0")
+def truePositiveRateByLabel(self):
+"""
+Returns true positive rate for each label (category).
+"""
+return self._call_java("truePositiveRateByLabel")
+
+@property
+@since("2.3.0")
+def falsePositiveRateByLabel(self):
+"""
+Returns false positive rate for each label (category).
+"""
+return self._call_java("falsePositiveRateByLabel")
+
+@property
+@since("2.3.0")
+def precisionByLabel(self):
+"""
+Returns precision for each label (category).
+"""
+return self._call_java("precisionByLabel")
+
+@property
+@since("2.3.0")
+def recallByLabel(self):
+"""
+Returns recall for each label (category).
+"""
+return self._call_java("recallByLabel")
+
+@property
--- End diff --

Remove this annotation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138047555
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1478,6 +1478,40 @@ def test_logistic_regression_summary(self):
 sameSummary = model.evaluate(df)
 self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC)
 
+def test_multiclass_logistic_regression_summary(self):
+df = self.spark.createDataFrame([(1.0, 2.0, Vectors.dense(1.0)),
+ (0.0, 2.0, Vectors.sparse(1, [], 
[])),
+ (2.0, 2.0, Vectors.dense(2.0)),
+ (2.0, 2.0, Vectors.dense(1.9))],
+["label", "weight", "features"])
+lr = LogisticRegression(maxIter=5, regParam=0.01, 
weightCol="weight", fitIntercept=False)
+model = lr.fit(df)
+self.assertTrue(model.hasSummary)
+s = model.summary
+# test that api is callable and returns expected types
+self.assertTrue(isinstance(s.predictions, DataFrame))
+self.assertEqual(s.probabilityCol, "probability")
+self.assertEqual(s.labelCol, "label")
+self.assertEqual(s.featuresCol, "features")
+self.assertEqual(s.predictionCol, "prediction")
+objHist = s.objectiveHistory
+self.assertTrue(isinstance(objHist, list) and 
isinstance(objHist[0], float))
+self.assertGreater(s.totalIterations, 0)
+self.assertTrue(isinstance(s.labels, list))
+self.assertTrue(isinstance(s.truePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.falsePositiveRateByLabel, list))
+self.assertTrue(isinstance(s.precisionByLabel, list))
+self.assertTrue(isinstance(s.recallByLabel, list))
+self.assertTrue(isinstance(s.fMeasureByLabel, list))
+self.assertAlmostEqual(s.accuracy, 0.75, 2)
+self.assertAlmostEqual(s.weightedTruePositiveRate, 0.75, 2)
+self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.25, 2)
+self.assertAlmostEqual(s.weightedRecall, 0.75, 2)
+self.assertAlmostEqual(s.weightedPrecision, 0.583, 2)
+self.assertAlmostEqual(s.weightedFMeasure, 0.65, 2)
+# test evaluation (with training dataset) produces a summary with 
same values
+# one check is enough to verify a summary is returned, Scala 
version runs full test
--- End diff --

Please add test for evaluation like:
```
sameSummary = model.evaluate(df)
self.assertAlmostEqual(sameSummary.accuracy, s.accuracy)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138045280
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -529,8 +529,11 @@ def summary(self):
 """
 if self.hasSummary:
 java_blrt_summary = self._call_java("summary")
-# Note: Once multiclass is added, update this to return 
correct summary
-return 
BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+if (self.numClasses == 2):
+java_blrt_binarysummary = self._call_java("binarySummary")
--- End diff --

Actually this is not necessary, we can just wrap ```java_lrt_summary``` 
with ```BinaryLogisticRegressionTrainingSummary```.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138045342
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -529,8 +529,11 @@ def summary(self):
 """
 if self.hasSummary:
 java_blrt_summary = self._call_java("summary")
-# Note: Once multiclass is added, update this to return 
correct summary
-return 
BinaryLogisticRegressionTrainingSummary(java_blrt_summary)
+if (self.numClasses == 2):
--- End diff --

```if (self.numClasses <= 2)```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...

2017-09-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/19185#discussion_r138045070
  
--- Diff: python/pyspark/ml/classification.py ---
@@ -529,8 +529,11 @@ def summary(self):
 """
 if self.hasSummary:
 java_blrt_summary = self._call_java("summary")
--- End diff --

Rename this to ```java_lrt_summary```, as it's not always _binary_ logistic 
regression.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138078894
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
 ---
@@ -612,6 +612,54 @@ class JsonExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 )
   }
 
+  test("SPARK-21513: to_json support map[string, struct] to json") {
+val schema = MapType(StringType, StructType(StructField("a", 
IntegerType) :: Nil))
+val input = Literal.create(ArrayBasedMapData(Map("test" -> 
InternalRow(1))), schema)
+checkEvaluation(
+  StructsToJson(Map.empty, input),
+  """{"test":{"a":1}}"""
+)
+  }
+
+  test("SPARK-21513: to_json support map[struct, struct] to json") {
+val schema = MapType(StructType(StructField("a", IntegerType) :: Nil),
+  StructType(StructField("b", IntegerType) :: Nil))
+val input = Literal.create(ArrayBasedMapData(Map(InternalRow(1) -> 
InternalRow(2))), schema)
+checkEvaluation(
+  StructsToJson(Map.empty, input),
+  """{"[1]":{"b":2}}"""
+)
+  }
+
+  test("SPARK-21513: to_json support map[string, integer] to json") {
+val schema = MapType(StringType, IntegerType)
+val input = Literal.create(ArrayBasedMapData(Map("a" -> 1)), schema)
+checkEvaluation(
+  StructsToJson(Map.empty, input),
+  """{"a":1}"""
+)
+  }
+
+  test("to_json - array with maps") {
+val inputSchema = ArrayType(MapType(StringType, IntegerType))
+val input = new GenericArrayData(ArrayBasedMapData(
+  Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) :: Nil)
+val output = """[{"a":1},{"b":2}]"""
+checkEvaluation(
+  StructsToJson(Map.empty, Literal.create(input, inputSchema), gmtId),
+  output)
+  }
+
+  test("to_json - array with single map") {
+val inputSchema = ArrayType(MapType(StringType, IntegerType))
+val input = new GenericArrayData(ArrayBasedMapData(
+  Map("a" -> 1)) :: Nil)
--- End diff --

Let's make this inlined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138079126
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 ---
@@ -193,14 +226,26 @@ private[sql] class JacksonGenerator(
*
* @param row The row to convert
*/
-  def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, 
rootFieldWriters))
+  def write(row: InternalRow): Unit = {
+writeObject(writeFields(row, dataType.asInstanceOf[StructType], 
rootFieldWriters))
+  }
+
 
   /**
-   * Transforms multiple `InternalRow`s to JSON array using Jackson
+   * Transforms multiple `InternalRow`s or `MapData`s to JSON array using 
Jackson
*
-   * @param array The array of rows to convert
+   * @param array The array of rows or maps to convert
*/
   def write(array: ArrayData): Unit = writeArray(writeArrayData(array, 
arrElementWriter))
 
+  /**
+   * Transforms a `MapData` to JSON object using Jackson
--- End diff --

Not a big deal but `` a `MapData` `` -> `` a single `MapData` `` just for 
consistency with ` write(row: InternalRow)`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19182
  
**[Test build #3916 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3916/testReport)**
 for PR 19182 at commit 
[`b69a20b`](https://github.com/apache/spark/commit/b69a20bf5a1c52dc69e3bbc61c804d3846c72137).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19188
  
**[Test build #81637 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81637/testReport)**
 for PR 19188 at commit 
[`322c335`](https://github.com/apache/spark/commit/322c335a027edc8c6705b72fd76f00e6933a987f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18853: [SPARK-21646][SQL] CommonType for binary comparison

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18853
  
**[Test build #81638 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81638/testReport)**
 for PR 18853 at commit 
[`3bec6a2`](https://github.com/apache/spark/commit/3bec6a22565c58ad2da18d66e1e285d644f3577a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18538
  
**[Test build #81639 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)**
 for PR 18538 at commit 
[`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19190: [SPARK-21976][DOC]

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19190
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138058970
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 ---
@@ -26,20 +26,53 @@ import 
org.apache.spark.sql.catalyst.expressions.SpecializedGetters
 import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, 
MapData}
 import org.apache.spark.sql.types._
 
+/**
+ * `JackGenerator` can only be initialized with a `StructType` or a 
`MapType`.
+ * Once it is initialized with `StructType`, it can be used to write out a 
struct or an array of
+ * struct. Once it is initialized with ``MapType``, it can be used to 
write out a map or an array
+ * of map. An exception will be thrown if trying to write out a struct if 
it is initialized with
+ * a `MapType`, and vice verse.
+ */
 private[sql] class JacksonGenerator(
-schema: StructType,
+dataType: DataType,
 writer: Writer,
 options: JSONOptions) {
   // A `ValueWriter` is responsible for writing a field of an 
`InternalRow` to appropriate
   // JSON data. Here we are using `SpecializedGetters` rather than 
`InternalRow` so that
   // we can directly access data in `ArrayData` without the help of 
`SpecificMutableRow`.
   private type ValueWriter = (SpecializedGetters, Int) => Unit
 
+  // `JackGenerator` can only be initialized with a `StructType` or a 
`MapType`.
+  dataType match {
+case _: StructType | _: MapType =>
+case _ => throw new UnsupportedOperationException(
+  s"`JacksonGenerator` only supports to be initialized with a 
`StructType` " +
--- End diff --

This `s` could be removed too. Let's avoid backquoting classes in the error 
messages.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138073822
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.json
+
+import java.io.CharArrayWriter
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, 
DateTimeUtils, GenericArrayData}
+import org.apache.spark.sql.types._
+
+class JacksonGeneratorSuite extends SparkFunSuite {
+
+  val gmtId = DateTimeUtils.TimeZoneGMT.getID
+  val option = new JSONOptions(Map.empty, gmtId)
+
+  test("initial with StructType and write out a row") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = InternalRow(1)
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """{"a":1}""")
+  }
+
+  test("initial with StructType and write out rows") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = new GenericArrayData(InternalRow(1) :: InternalRow(2) :: 
Nil)
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """[{"a":1},{"a":2}]""")
+  }
+
+  test("initial with StructType and write out an array with single empty 
row") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = new GenericArrayData(InternalRow(null) :: Nil)
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """[{}]""")
+  }
+
+  test("initial with StructType and write out an empty array") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = new GenericArrayData(Nil)
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """[]""")
+  }
+
+  test("initial with Map and write out a map data") {
+val dataType = MapType(StringType, IntegerType)
+val input = ArrayBasedMapData(Map("a" -> 1))
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """{"a":1}""")
+  }
+
+  test("initial with Map and write out an array of maps") {
+val dataType = MapType(StringType, IntegerType)
+val input = new GenericArrayData(
+  ArrayBasedMapData(Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) 
:: Nil)
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+gen.write(input)
+gen.flush()
+assert(writer.toString === """[{"a":1},{"b":2}]""")
+  }
+
+  test("error handling: initial with StructType but error calling write a 
map") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = ArrayBasedMapData(Map("a" -> 1))
+val writer = new CharArrayWriter()
+val gen = new JacksonGenerator(dataType, writer, option)
+intercept[Exception] {
+  gen.write(input)
+}
+  }
+
+  test("error handling: initial with StructType but error calling write an 
array of maps") {
+val dataType = StructType(StructField("a", IntegerType) :: Nil)
+val input = new GenericArrayData(
+  ArrayBasedMapData(Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) 
:: Nil)
+val writer = new 

[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138077808
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala ---
@@ -180,10 +180,30 @@ class JsonFunctionsSuite extends QueryTest with 
SharedSQLContext {
 
   test("to_json - array") {
 val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
+val df2 = Seq(Tuple1(Map("a" -> 1) :: Nil)).toDF("a")
 
 checkAnswer(
   df.select(to_json($"a")),
   Row("""[{"_1":1}]""") :: Nil)
+checkAnswer(
+  df2.select(to_json($"a")),
+  Row("""[{"a":1}]""") :: Nil)
+  }
+
+  test("to_json - map") {
+val df1 = Seq(Map("a" -> Tuple1(1))).toDF("a")
+val df2 = Seq(Map(Tuple1(1) -> Tuple1(1))).toDF("a")
+val df3 = Seq(Map("a" -> 1)).toDF("a")
+
+checkAnswer(
+  df1.select(to_json($"a")),
+  Row("""{"a":{"_1":1}}""") :: Nil)
+checkAnswer(
+  df2.select(to_json($"a")),
+  Row("""{"[0,1]":{"_1":1}}""") :: Nil)
--- End diff --

This case looks rather a string representation from `UnsafeRow` and a bug. 
Let's remove this test case for now and fix it later together in another PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138057979
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -677,14 +696,25 @@ case class StructsToJson(
   override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
 case _: StructType | ArrayType(_: StructType, _) =>
   try {
-JacksonUtils.verifySchema(rowSchema)
+JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType])
+TypeCheckResult.TypeCheckSuccess
+  } catch {
+case e: UnsupportedOperationException =>
+  TypeCheckResult.TypeCheckFailure(e.getMessage)
+  }
+case _: MapType | ArrayType(_: MapType, _) =>
+  // TODO: let `JacksonUtils.verifySchema` verify a `MapType`
+  try {
+val st = StructType(StructField("a", 
rowSchema.asInstanceOf[MapType]) :: Nil)
+JacksonUtils.verifySchema(st)
 TypeCheckResult.TypeCheckSuccess
   } catch {
 case e: UnsupportedOperationException =>
   TypeCheckResult.TypeCheckFailure(e.getMessage)
   }
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input type ${child.dataType.simpleString} must be a struct or 
array of structs.")
+  s"Input type ${child.dataType.simpleString} must be a struct, array 
of structs or " +
+  s"a map or array of map.")
--- End diff --

Looks `s` could be removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138058695
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 ---
@@ -26,20 +26,53 @@ import 
org.apache.spark.sql.catalyst.expressions.SpecializedGetters
 import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, 
MapData}
 import org.apache.spark.sql.types._
 
+/**
+ * `JackGenerator` can only be initialized with a `StructType` or a 
`MapType`.
+ * Once it is initialized with `StructType`, it can be used to write out a 
struct or an array of
+ * struct. Once it is initialized with ``MapType``, it can be used to 
write out a map or an array
+ * of map. An exception will be thrown if trying to write out a struct if 
it is initialized with
+ * a `MapType`, and vice verse.
+ */
 private[sql] class JacksonGenerator(
-schema: StructType,
+dataType: DataType,
 writer: Writer,
 options: JSONOptions) {
   // A `ValueWriter` is responsible for writing a field of an 
`InternalRow` to appropriate
   // JSON data. Here we are using `SpecializedGetters` rather than 
`InternalRow` so that
   // we can directly access data in `ArrayData` without the help of 
`SpecificMutableRow`.
   private type ValueWriter = (SpecializedGetters, Int) => Unit
 
+  // `JackGenerator` can only be initialized with a `StructType` or a 
`MapType`.
+  dataType match {
--- End diff --

I'd just do a `require(... .isInstanceOf... || ..., "...")` instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r138058166
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 ---
@@ -26,20 +26,53 @@ import 
org.apache.spark.sql.catalyst.expressions.SpecializedGetters
 import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, 
MapData}
 import org.apache.spark.sql.types._
 
+/**
+ * `JackGenerator` can only be initialized with a `StructType` or a 
`MapType`.
+ * Once it is initialized with `StructType`, it can be used to write out a 
struct or an array of
+ * struct. Once it is initialized with ``MapType``, it can be used to 
write out a map or an array
--- End diff --

nit: ``` ``MapType`` ``` -> `` `MapType` ``


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19147
  
**[Test build #81635 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81635/testReport)**
 for PR 19147 at commit 
[`803054e`](https://github.com/apache/spark/commit/803054e9f30b057e1b194b5625cb6216d865f1d4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15178
  
**[Test build #81636 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81636/testReport)**
 for PR 15178 at commit 
[`93ecabb`](https://github.com/apache/spark/commit/93ecabb7c7c9b2fa42a321e28f834568cadf0272).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19147
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81635/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81636/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15178
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19190: [SPARK-21976][DOC]

2017-09-11 Thread FavioVazquez
GitHub user FavioVazquez opened a pull request:

https://github.com/apache/spark/pull/19190

[SPARK-21976][DOC]

## What changes were proposed in this pull request?

Fixed wrong documentation for Mean Absolute Error.

Even though the code is correct for the MAE:

```scala
@Since("1.2.0")
  def meanAbsoluteError: Double = {
summary.normL1(1) / summary.count
  }
```
In the documentation the division by N is missing.

## How was this patch tested?

All of spark tests were run.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/FavioVazquez/spark mae-fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19190.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19190


commit b2e2f8caef4f43d62cb48fc09a0fe17e71f3f4dd
Author: FavioVazquez 
Date:   2015-04-30T19:46:40Z

Merge remote-tracking branch 'upstream/master'

commit edab1ef07312cf44e26fccb7fca8f4a4977ad3ee
Author: FavioVazquez 
Date:   2015-05-05T14:16:02Z

Merge remote-tracking branch 'upstream/master'

commit 9af7074235b6c13001924e037772195b640115b8
Author: FavioVazquez 
Date:   2015-05-15T13:58:04Z

Merge remote-tracking branch 'upstream/master'

commit f27a20b9f53b643d5c963729f56ee548d0c8e263
Author: FavioVazquez 
Date:   2015-06-04T16:10:00Z

Merge remote-tracking branch 'upstream/master'

commit ad882a378e24458832b961fd97eb4b7662203ef9
Author: FavioVazquez 
Date:   2015-06-09T12:47:59Z

Merge remote-tracking branch 'upstream/master'

commit 424a92853f44c34f310e0a9e8dd927d246bd9171
Author: FavioVazquez 
Date:   2015-06-17T14:56:35Z

Merge remote-tracking branch 'upstream/master'

commit 5311719db61454eee5e1715f44507c294509ec1c
Author: FavioVazquez 
Date:   2015-09-21T06:43:29Z

Merge remote-tracking branch 'upstream/master'

commit d6b551b1d25cef07096b7e1fc22b659ed753d9dc
Author: faviovazquez 
Date:   2017-09-11T13:36:27Z

Merge remote-tracking branch 'upstream/master'

commit a95cfc6c5e88f44429319b52462b537ae0bc1857
Author: Favio André Vázquez 
Date:   2017-09-11T13:49:03Z

Fix doc for MAE




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...

2017-09-11 Thread original-brownbear
Github user original-brownbear commented on the issue:

https://github.com/apache/spark/pull/19182
  
@srowen makes perfect sense => rolled back all changes to tests + publicly 
exposed methods (those package private ones adjusted are on non-public classes).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19147
  
**[Test build #81621 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81621/testReport)**
 for PR 19147 at commit 
[`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81624 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81624/testReport)**
 for PR 18875 at commit 
[`f567679`](https://github.com/apache/spark/commit/f567679ec724052266587e35cdf25868587de750).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19186
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81620 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81620/testReport)**
 for PR 18875 at commit 
[`b8ad442`](https://github.com/apache/spark/commit/b8ad442437390611c63d8b0705d02c3ca3ba3c9c).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19186
  
**[Test build #81623 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81623/testReport)**
 for PR 19186 at commit 
[`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81625 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81625/testReport)**
 for PR 18875 at commit 
[`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81630 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81630/testReport)**
 for PR 18875 at commit 
[`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19187: Branch 2.1

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19187
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19147
  
**[Test build #81629 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81629/testReport)**
 for PR 19147 at commit 
[`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19184
  
**[Test build #81628 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81628/testReport)**
 for PR 19184 at commit 
[`ea5f9d9`](https://github.com/apache/spark/commit/ea5f9d9903185690670da6384dbd6d2b08b8177f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...

2017-09-11 Thread rajeshbalamohan
Github user rajeshbalamohan commented on the issue:

https://github.com/apache/spark/pull/19184
  
Thanks @viirya . I have updated the patch to address your comments.

This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) 
which involves window functions; but for the merger the issue needs to be 
addressed still. Agreed that this would be partial patch. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r137986264
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala
 ---
@@ -677,14 +696,25 @@ case class StructsToJson(
   override def checkInputDataTypes(): TypeCheckResult = child.dataType 
match {
 case _: StructType | ArrayType(_: StructType, _) =>
   try {
-JacksonUtils.verifySchema(rowSchema)
+JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType])
+TypeCheckResult.TypeCheckSuccess
+  } catch {
+case e: UnsupportedOperationException =>
+  TypeCheckResult.TypeCheckFailure(e.getMessage)
+  }
+case _: MapType | ArrayType(_: MapType, _) =>
+  // TODO: let `JacksonUtils.verifySchema` verify a `MapType`
+  try {
+val st = StructType(StructField("a", 
rowSchema.asInstanceOf[MapType]) :: Nil)
+JacksonUtils.verifySchema(st)
 TypeCheckResult.TypeCheckSuccess
   } catch {
 case e: UnsupportedOperationException =>
   TypeCheckResult.TypeCheckFailure(e.getMessage)
   }
 case _ => TypeCheckResult.TypeCheckFailure(
-  s"Input type ${child.dataType.simpleString} must be a struct or 
array of structs.")
+  s"Input type ${child.dataType.simpleString} must be a struct, array 
of structs or " +
+  s"an arbitrary map.")
--- End diff --

`an arbitrary map` -> `a map or array of map.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...

2017-09-11 Thread goldmedal
Github user goldmedal commented on a diff in the pull request:

https://github.com/apache/spark/pull/18875#discussion_r137987843
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala
 ---
@@ -193,14 +228,32 @@ private[sql] class JacksonGenerator(
*
* @param row The row to convert
*/
-  def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, 
rootFieldWriters))
+  def write(row: InternalRow): Unit = dataType match {
+case st: StructType =>
+  writeObject(writeFields(row, st, rootFieldWriters))
+case _ => throw new UnsupportedOperationException(
--- End diff --

It will throw `ClassCastException`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...

2017-09-11 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16677
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81625 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81625/testReport)**
 for PR 18875 at commit 
[`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...

2017-09-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18875
  
**[Test build #81631 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81631/testReport)**
 for PR 18875 at commit 
[`be62e51`](https://github.com/apache/spark/commit/be62e51109c8a438fed8d787fb2986aeb1974f82).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...

2017-09-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19188
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81633/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   >