[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19188#discussion_r138009182 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala --- @@ -113,12 +114,40 @@ object TPCDSQueryBenchmark { "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90", "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99") +val sparkConf = new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + .set("spark.sql.crossJoin.enabled", "true") + +// If `spark.sql.tpcds.queryFilter` defined, this class filters the queries that --- End diff -- nit: `class` -> `variable`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81635 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81635/testReport)** for PR 19147 at commit [`803054e`](https://github.com/apache/spark/commit/803054e9f30b057e1b194b5625cb6216d865f1d4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19131: [MINOR][SQL]remove unuse import class
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19131 I'm going to merge it as a cleanup but yeah let's not do these often. I would favor adding a style check for this if one can be found, but don't see it in scalastyle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81629/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19169: [SPARK-21957][SQL] Add current_user function
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19169 On local I solved the build error by running a `mvn clean`. As pointed out by @maropu , a PR removed the class and then the incremental compilation fails. I am not sure why this is happening on jenkins though... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19147#discussion_r138012166 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala --- @@ -62,6 +62,7 @@ import org.apache.spark.util.Utils */ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) --- End diff -- Thanks! Let's see if others have any suggestions for a while. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/15178 @rxin Do we still consider to incorporate this broadcast on executor feature? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18337: [SPARK-21131][GraphX] Fix batch gradient bug in SVDPlusP...
Github user lxmly commented on the issue: https://github.com/apache/spark/pull/18337 which dataset? @daniellaah --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138023290 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, + "Number of clusters must be greater than one.") + +$(metricName) match { + case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( +dataset, +$(predictionCol), +$(featuresCol) + ) +} + } +} + + +@Since("2.3.0") +object ClusteringEvaluator + extends DefaultParamsReadable[ClusteringEvaluator] { + + @Since("2.3.0") + override def load(path: String): ClusteringEvaluator = super.load(path) + +} + + +/** + * SquaredEuclideanSilhouette computes the average of
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138027427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) --- End diff -- I'd suggest the metric name is ```silhouette```, since we may add silhouette for other distance, then we can add another param like ```distance``` to control that. The param ```metricName``` should not bind to any distance computation way. There are lots of other metrics for clustering algorithms, like [these](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics) in sklearn. We would not add all of them for MLlib, but we may add part of them in the future. cc @jkbradley @MLnick @WeichenXu123 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138024385 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, + "Number of clusters must be greater than one.") + +$(metricName) match { + case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( +dataset, +$(predictionCol), +$(featuresCol) + ) +} + } +} + + +@Since("2.3.0") +object ClusteringEvaluator + extends DefaultParamsReadable[ClusteringEvaluator] { + + @Since("2.3.0") + override def load(path: String): ClusteringEvaluator = super.load(path) + +} + + +/** + * SquaredEuclideanSilhouette computes the average of
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138025184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, + "Number of clusters must be greater than one.") + +$(metricName) match { + case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( +dataset, +$(predictionCol), +$(featuresCol) + ) +} + } +} + + +@Since("2.3.0") +object ClusteringEvaluator + extends DefaultParamsReadable[ClusteringEvaluator] { + + @Since("2.3.0") + override def load(path: String): ClusteringEvaluator = super.load(path) + +} + + +/** + * SquaredEuclideanSilhouette computes the average of
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138025640 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, --- End diff -- Move this check to L418, in case another unnecessary computation for most of the cases(cluster size > 1). See my comment at L418. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138024573 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, + "Number of clusters must be greater than one.") + +$(metricName) match { + case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( +dataset, +$(predictionCol), +$(featuresCol) + ) +} + } +} + + +@Since("2.3.0") +object ClusteringEvaluator + extends DefaultParamsReadable[ClusteringEvaluator] { + + @Since("2.3.0") + override def load(path: String): ClusteringEvaluator = super.load(path) + +} + + +/** + * SquaredEuclideanSilhouette computes the average of
[GitHub] spark pull request #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18538#discussion_r138021102 --- Diff: mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.evaluation + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.broadcast.Broadcast +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.{Param, ParamMap, ParamValidators} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasPredictionCol} +import org.apache.spark.ml.util.{DefaultParamsReadable, DefaultParamsWritable, Identifiable, SchemaUtils} +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.functions.{avg, col, udf} +import org.apache.spark.sql.types.IntegerType + +/** + * :: Experimental :: + * Evaluator for clustering results. + * The metric computes the Silhouette measure + * using the squared Euclidean distance. + * + * The Silhouette is a measure for the validation + * of the consistency within clusters. It ranges + * between 1 and -1, where a value close to 1 + * means that the points in a cluster are close + * to the other points in the same cluster and + * far from the points of the other clusters. + */ +@Experimental +@Since("2.3.0") +class ClusteringEvaluator @Since("2.3.0") (@Since("2.3.0") override val uid: String) + extends Evaluator with HasPredictionCol with HasFeaturesCol with DefaultParamsWritable { + + @Since("2.3.0") + def this() = this(Identifiable.randomUID("cluEval")) + + @Since("2.3.0") + override def copy(pMap: ParamMap): ClusteringEvaluator = this.defaultCopy(pMap) + + @Since("2.3.0") + override def isLargerBetter: Boolean = true + + /** @group setParam */ + @Since("2.3.0") + def setPredictionCol(value: String): this.type = set(predictionCol, value) + + /** @group setParam */ + @Since("2.3.0") + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** + * param for metric name in evaluation + * (supports `"squaredSilhouette"` (default)) + * @group param + */ + @Since("2.3.0") + val metricName: Param[String] = { +val allowedParams = ParamValidators.inArray(Array("squaredSilhouette")) +new Param( + this, + "metricName", + "metric name in evaluation (squaredSilhouette)", + allowedParams +) + } + + /** @group getParam */ + @Since("2.3.0") + def getMetricName: String = $(metricName) + + /** @group setParam */ + @Since("2.3.0") + def setMetricName(value: String): this.type = set(metricName, value) + + setDefault(metricName -> "squaredSilhouette") + + @Since("2.3.0") + override def evaluate(dataset: Dataset[_]): Double = { +SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT) +SchemaUtils.checkColumnType(dataset.schema, $(predictionCol), IntegerType) + +// Silhouette is reasonable only when the number of clusters is grater then 1 +assert(dataset.select($(predictionCol)).distinct().count() > 1, + "Number of clusters must be greater than one.") + +$(metricName) match { + case "squaredSilhouette" => SquaredEuclideanSilhouette.computeSilhouetteScore( +dataset, +$(predictionCol), +$(featuresCol) + ) +} + } +} + + +@Since("2.3.0") +object ClusteringEvaluator + extends DefaultParamsReadable[ClusteringEvaluator] { + + @Since("2.3.0") + override def load(path: String): ClusteringEvaluator = super.load(path) + +} + + +/** + * SquaredEuclideanSilhouette computes the average of
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81632 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81632/testReport)** for PR 18875 at commit [`8835aca`](https://github.com/apache/spark/commit/8835aca5a5ccf420010b24e43189bfa8f1e8dc2a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81627/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16677 **[Test build #81627 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81627/testReport)** for PR 16677 at commit [`f2a7aac`](https://github.com/apache/spark/commit/f2a7aacc22caf27c1a8af612c9432586e6a86d17). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user goldmedal commented on the issue: https://github.com/apache/spark/pull/18875 @HyukjinKwon We have finished the `MapType` and `ArrayType` of `MapType`s supporting. Please take a look when you are available. Thanks :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81629 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81629/testReport)** for PR 19147 at commit [`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19147 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19147#discussion_r138003592 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala --- @@ -62,6 +62,7 @@ import org.apache.spark.util.Utils */ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) --- End diff -- How about `BlockedEvalPythonExec` or something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/17819 **[Test build #81634 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81634/testReport)** for PR 17819 at commit [`f8dedd1`](https://github.com/apache/spark/commit/f8dedd1c92a8c48358743626b99c2f2192bc09b1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81631/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19186 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81634/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/17819 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19086: [SPARK-21874][SQL] Support changing database when rename...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/19086 @gatorsmile OK and thanks a lot for review :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19184 @rajeshbalamohan Thanks for updating. I think we need a complete fix as previous comments from the reviewers @jerryshao @kiszk @jiangxb1987 suggested. Can you try to fix this according to the comments? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
GitHub user fjh100456 opened a pull request: https://github.com/apache/spark/pull/19189 [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configuration doesn't take effect on tables with partition field(s) ## What changes were proposed in this pull request? Pass the spark configuration âspark.sql.parquet.compression.codecâ to hive so that it can take effect. ## How was this patch tested? Manually checked. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fjh100456/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19189.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19189 commit f91ed07718d4462e487237353bf82c80c5e148f7 Author: fjh100456Date: 2017-06-19T08:53:06Z [SPARK-21135][WEB UI] On history server pageï¼duration of incompleted applications should be hidden instead of showing up as 0 ## What changes were proposed in this pull request? Hide duration of incomplete applications. ## How was this patch tested? manual tests commit faffde55e38d1db9f54338f29429f98baf73def5 Author: fjh100456 Date: 2017-06-30T06:15:01Z Merge pull request #1 from apache/master æ´æ°ä»£ç å°2017-06-30 commit 44ff604cb6fca37acafd018c3e25888fbe1df1ae Author: fjh100456 Date: 2017-09-08T09:02:59Z Merge pull request #2 from apache/master Merge spark master --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compression.co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19189 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19174: [SPARK-21963][CORE][TEST]Create temp file should be dele...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19174 Same, let's not bother with stuff this trivial @heary-cao please. If it really makes the code consistent on this one point, I'm not against this, other than that it encourages more PRs this small. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19186 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81626/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19186 **[Test build #81626 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81626/testReport)** for PR 19186 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16677 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/17819 ping @MLnick Can you have time to help review this recently? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs i...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19147#discussion_r138010327 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala --- @@ -62,6 +62,7 @@ import org.apache.spark.util.Utils */ case class BatchEvalPythonExec(udfs: Seq[PythonUDF], output: Seq[Attribute], child: SparkPlan) --- End diff -- I feel it is better than `BatchEvalPythonExec`. Don't know if others have any suggestion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19188: [SPARK-21973][SQL] Add an new option to filter qu...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19188#discussion_r138010133 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala --- @@ -113,12 +114,40 @@ object TPCDSQueryBenchmark { "q81", "q82", "q83", "q84", "q85", "q86", "q87", "q88", "q89", "q90", "q91", "q92", "q93", "q94", "q95", "q96", "q97", "q98", "q99") +val sparkConf = new SparkConf() + .setMaster("local[1]") + .setAppName("test-sql-context") + .set("spark.sql.parquet.compression.codec", "snappy") + .set("spark.sql.shuffle.partitions", "4") + .set("spark.driver.memory", "3g") + .set("spark.executor.memory", "3g") + .set("spark.sql.autoBroadcastJoinThreshold", (20 * 1024 * 1024).toString) + .set("spark.sql.crossJoin.enabled", "true") + +// If `spark.sql.tpcds.queryFilter` defined, this class filters the queries that +// this option selects. +val queryFilter = sparkConf + .getOption("spark.sql.tpcds.queryFilter").map(_.split(",").map(_.trim).toSet) + .getOrElse(Set.empty) + +val queriesToRun = if (queryFilter.nonEmpty) { + val queries = tpcdsAllQueries.filter { case queryName => queryFilter.contains(queryName) } + if (queries.isEmpty) { +throw new RuntimeException("Bad query name filter: " + queryFilter) + } + queries +} else { + tpcdsAllQueries +} + // In order to run this benchmark, please follow the instructions at // https://github.com/databricks/spark-sql-perf/blob/master/README.md to generate the TPCDS data // locally (preferably with a scale factor of 5 for benchmarking). Thereafter, the value of // dataLocation below needs to be set to the location where the generated data is stored. val dataLocation = "" -tpcdsAll(dataLocation, queries = tpcdsQueries) +val spark = SparkSession.builder.config(sparkConf).getOrCreate() +val tpcdsQueries = TpcdsQueries(spark, queries = queriesToRun, dataLocation) --- End diff -- nit: Do we need `queries =`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15178 **[Test build #81636 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81636/testReport)** for PR 15178 at commit [`93ecabb`](https://github.com/apache/spark/commit/93ecabb7c7c9b2fa42a321e28f834568cadf0272). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19189: [SPARK-21786][SQL] The 'spark.sql.parquet.compres...
Github user fjh100456 closed the pull request at: https://github.com/apache/spark/pull/19189 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19172: [SPARK-21856] Add probability and rawPrediction to MLPC ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/19172 LGTM2, merged into master. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/19182 Some of these 'throws' clauses may not be removable because they cause callers that catch the checked exception to fail to compile. Removing "throws Exception" in tests isn't obviously helpful, because it means the signature has to change any time the tested code happens to throw a new checked exception. There's no API problem as nothing invokes these methods directly. Private non-test methods, maybe. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18732: [SPARK-20396][SQL][PySpark] groupby().apply() with panda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18732 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19132: [SPARK-21922] Fix duration always updating when task fai...
Github user caneGuy commented on the issue: https://github.com/apache/spark/pull/19132 @jerryshao Thanks for your time. IIUC, event log is completed since driver has not dropped any event of executor which has problem described above.See below,driver only drop two events after shutting down: `2017-09-11,17:08:57,605 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(300,WrappedArray()) 2017-09-11,17:08:57,623 INFO org.apache.spark.deploy.yarn.YarnAllocator: Driver requested a total number of 0 executor(s). 2017-09-11,17:08:57,624 INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Shutting down all executors 2017-09-11,17:08:57,624 INFO org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 2017-09-11,17:08:57,629 ERROR org.apache.spark.scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(258,WrappedArray())` And below showing an executor has this problem: ![running](https://user-images.githubusercontent.com/26762018/30268618-ea27e8bc-9718-11e7-9f40-3071116255eb.png) ![running02](https://user-images.githubusercontent.com/26762018/30268619-ea318c6e-9718-11e7-968d-ba2174e2f079.png) Case here is a job running today of our cluster. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81631 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81631/testReport)** for PR 18875 at commit [`be62e51`](https://github.com/apache/spark/commit/be62e51109c8a438fed8d787fb2986aeb1974f82). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18875 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81632/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19184 **[Test build #81628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81628/testReport)** for PR 19184 at commit [`ea5f9d9`](https://github.com/apache/spark/commit/ea5f9d9903185690670da6384dbd6d2b08b8177f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19184 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81628/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19184 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19147 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81639 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)** for PR 18538 at commit [`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/18538 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81639/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19190: [SPARK-21976][DOC]
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19190 **[Test build #3917 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3917/testReport)** for PR 19190 at commit [`a95cfc6`](https://github.com/apache/spark/commit/a95cfc6c5e88f44429319b52462b537ae0bc1857). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138047016 --- Diff: python/pyspark/ml/classification.py --- @@ -603,6 +614,112 @@ def featuresCol(self): """ return self._call_java("featuresCol") +@property +@since("2.3.0") +def labels(self): +""" +Returns the sequence of labels in ascending order. This order matches the order used +in metrics which are specified as arrays over labels, e.g., truePositiveRateByLabel. + +Note: In most cases, it will be values {0.0, 1.0, ..., numClasses-1}, However, if the +training set is missing a label, then all of the arrays over labels +(e.g., from truePositiveRateByLabel) will be of length numClasses-1 instead of the +expected numClasses. +""" +return self._call_java("labels") + +@property +@since("2.3.0") +def truePositiveRateByLabel(self): +""" +Returns true positive rate for each label (category). +""" +return self._call_java("truePositiveRateByLabel") + +@property +@since("2.3.0") +def falsePositiveRateByLabel(self): +""" +Returns false positive rate for each label (category). +""" +return self._call_java("falsePositiveRateByLabel") + +@property +@since("2.3.0") +def precisionByLabel(self): +""" +Returns precision for each label (category). +""" +return self._call_java("precisionByLabel") + +@property +@since("2.3.0") +def recallByLabel(self): +""" +Returns recall for each label (category). +""" +return self._call_java("recallByLabel") + +@property +@since("2.3.0") +def fMeasureByLabel(self, beta=1.0): +""" +Returns f-measure for each label (category). +""" +return self._call_java("fMeasureByLabel", beta) + +@property +@since("2.3.0") +def accuracy(self): +""" +Returns accuracy. +(equals to the total number of correctly classified instances +out of the total number of instances.) +""" +return self._call_java("accuracy") + +@property +@since("2.3.0") +def weightedTruePositiveRate(self): +""" +Returns weighted true positive rate. +(equals to precision, recall and f-measure) +""" +return self._call_java("weightedTruePositiveRate") + +@property +@since("2.3.0") +def weightedFalsePositiveRate(self): +""" +Returns weighted false positive rate. +""" +return self._call_java("weightedFalsePositiveRate") + +@property +@since("2.3.0") +def weightedRecall(self): +""" +Returns weighted averaged recall. +(equals to precision, recall and f-measure) +""" +return self._call_java("weightedRecall") + +@property +@since("2.3.0") +def weightedPrecision(self): +""" +Returns weighted averaged precision. +""" +return self._call_java("weightedPrecision") + +@property --- End diff -- Remove this annotation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138048004 --- Diff: python/pyspark/ml/tests.py --- @@ -1478,6 +1478,40 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC) +def test_multiclass_logistic_regression_summary(self): +df = self.spark.createDataFrame([(1.0, 2.0, Vectors.dense(1.0)), + (0.0, 2.0, Vectors.sparse(1, [], [])), + (2.0, 2.0, Vectors.dense(2.0)), + (2.0, 2.0, Vectors.dense(1.9))], +["label", "weight", "features"]) +lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight", fitIntercept=False) +model = lr.fit(df) +self.assertTrue(model.hasSummary) +s = model.summary +# test that api is callable and returns expected types +self.assertTrue(isinstance(s.predictions, DataFrame)) +self.assertEqual(s.probabilityCol, "probability") +self.assertEqual(s.labelCol, "label") +self.assertEqual(s.featuresCol, "features") +self.assertEqual(s.predictionCol, "prediction") +objHist = s.objectiveHistory +self.assertTrue(isinstance(objHist, list) and isinstance(objHist[0], float)) +self.assertGreater(s.totalIterations, 0) +self.assertTrue(isinstance(s.labels, list)) +self.assertTrue(isinstance(s.truePositiveRateByLabel, list)) +self.assertTrue(isinstance(s.falsePositiveRateByLabel, list)) +self.assertTrue(isinstance(s.precisionByLabel, list)) +self.assertTrue(isinstance(s.recallByLabel, list)) +self.assertTrue(isinstance(s.fMeasureByLabel, list)) +self.assertAlmostEqual(s.accuracy, 0.75, 2) +self.assertAlmostEqual(s.weightedTruePositiveRate, 0.75, 2) +self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.25, 2) +self.assertAlmostEqual(s.weightedRecall, 0.75, 2) +self.assertAlmostEqual(s.weightedPrecision, 0.583, 2) +self.assertAlmostEqual(s.weightedFMeasure, 0.65, 2) --- End diff -- We need to add these check for the above ```test_logistic_regression_summary``` and rename it to ```test_binary_logistic_regression_summary```, since binary logistic regression summary has these variables as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138046547 --- Diff: python/pyspark/ml/classification.py --- @@ -603,6 +614,112 @@ def featuresCol(self): """ return self._call_java("featuresCol") +@property +@since("2.3.0") +def labels(self): +""" +Returns the sequence of labels in ascending order. This order matches the order used +in metrics which are specified as arrays over labels, e.g., truePositiveRateByLabel. + +Note: In most cases, it will be values {0.0, 1.0, ..., numClasses-1}, However, if the +training set is missing a label, then all of the arrays over labels +(e.g., from truePositiveRateByLabel) will be of length numClasses-1 instead of the +expected numClasses. +""" +return self._call_java("labels") + +@property +@since("2.3.0") +def truePositiveRateByLabel(self): +""" +Returns true positive rate for each label (category). +""" +return self._call_java("truePositiveRateByLabel") + +@property +@since("2.3.0") +def falsePositiveRateByLabel(self): +""" +Returns false positive rate for each label (category). +""" +return self._call_java("falsePositiveRateByLabel") + +@property +@since("2.3.0") +def precisionByLabel(self): +""" +Returns precision for each label (category). +""" +return self._call_java("precisionByLabel") + +@property +@since("2.3.0") +def recallByLabel(self): +""" +Returns recall for each label (category). +""" +return self._call_java("recallByLabel") + +@property --- End diff -- Remove this annotation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138047555 --- Diff: python/pyspark/ml/tests.py --- @@ -1478,6 +1478,40 @@ def test_logistic_regression_summary(self): sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.areaUnderROC, s.areaUnderROC) +def test_multiclass_logistic_regression_summary(self): +df = self.spark.createDataFrame([(1.0, 2.0, Vectors.dense(1.0)), + (0.0, 2.0, Vectors.sparse(1, [], [])), + (2.0, 2.0, Vectors.dense(2.0)), + (2.0, 2.0, Vectors.dense(1.9))], +["label", "weight", "features"]) +lr = LogisticRegression(maxIter=5, regParam=0.01, weightCol="weight", fitIntercept=False) +model = lr.fit(df) +self.assertTrue(model.hasSummary) +s = model.summary +# test that api is callable and returns expected types +self.assertTrue(isinstance(s.predictions, DataFrame)) +self.assertEqual(s.probabilityCol, "probability") +self.assertEqual(s.labelCol, "label") +self.assertEqual(s.featuresCol, "features") +self.assertEqual(s.predictionCol, "prediction") +objHist = s.objectiveHistory +self.assertTrue(isinstance(objHist, list) and isinstance(objHist[0], float)) +self.assertGreater(s.totalIterations, 0) +self.assertTrue(isinstance(s.labels, list)) +self.assertTrue(isinstance(s.truePositiveRateByLabel, list)) +self.assertTrue(isinstance(s.falsePositiveRateByLabel, list)) +self.assertTrue(isinstance(s.precisionByLabel, list)) +self.assertTrue(isinstance(s.recallByLabel, list)) +self.assertTrue(isinstance(s.fMeasureByLabel, list)) +self.assertAlmostEqual(s.accuracy, 0.75, 2) +self.assertAlmostEqual(s.weightedTruePositiveRate, 0.75, 2) +self.assertAlmostEqual(s.weightedFalsePositiveRate, 0.25, 2) +self.assertAlmostEqual(s.weightedRecall, 0.75, 2) +self.assertAlmostEqual(s.weightedPrecision, 0.583, 2) +self.assertAlmostEqual(s.weightedFMeasure, 0.65, 2) +# test evaluation (with training dataset) produces a summary with same values +# one check is enough to verify a summary is returned, Scala version runs full test --- End diff -- Please add test for evaluation like: ``` sameSummary = model.evaluate(df) self.assertAlmostEqual(sameSummary.accuracy, s.accuracy) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138045280 --- Diff: python/pyspark/ml/classification.py --- @@ -529,8 +529,11 @@ def summary(self): """ if self.hasSummary: java_blrt_summary = self._call_java("summary") -# Note: Once multiclass is added, update this to return correct summary -return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) +if (self.numClasses == 2): +java_blrt_binarysummary = self._call_java("binarySummary") --- End diff -- Actually this is not necessary, we can just wrap ```java_lrt_summary``` with ```BinaryLogisticRegressionTrainingSummary```. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138045342 --- Diff: python/pyspark/ml/classification.py --- @@ -529,8 +529,11 @@ def summary(self): """ if self.hasSummary: java_blrt_summary = self._call_java("summary") -# Note: Once multiclass is added, update this to return correct summary -return BinaryLogisticRegressionTrainingSummary(java_blrt_summary) +if (self.numClasses == 2): --- End diff -- ```if (self.numClasses <= 2)``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19185: [Spark-21854] Added LogisticRegressionTrainingSum...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/19185#discussion_r138045070 --- Diff: python/pyspark/ml/classification.py --- @@ -529,8 +529,11 @@ def summary(self): """ if self.hasSummary: java_blrt_summary = self._call_java("summary") --- End diff -- Rename this to ```java_lrt_summary```, as it's not always _binary_ logistic regression. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138078894 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala --- @@ -612,6 +612,54 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { ) } + test("SPARK-21513: to_json support map[string, struct] to json") { +val schema = MapType(StringType, StructType(StructField("a", IntegerType) :: Nil)) +val input = Literal.create(ArrayBasedMapData(Map("test" -> InternalRow(1))), schema) +checkEvaluation( + StructsToJson(Map.empty, input), + """{"test":{"a":1}}""" +) + } + + test("SPARK-21513: to_json support map[struct, struct] to json") { +val schema = MapType(StructType(StructField("a", IntegerType) :: Nil), + StructType(StructField("b", IntegerType) :: Nil)) +val input = Literal.create(ArrayBasedMapData(Map(InternalRow(1) -> InternalRow(2))), schema) +checkEvaluation( + StructsToJson(Map.empty, input), + """{"[1]":{"b":2}}""" +) + } + + test("SPARK-21513: to_json support map[string, integer] to json") { +val schema = MapType(StringType, IntegerType) +val input = Literal.create(ArrayBasedMapData(Map("a" -> 1)), schema) +checkEvaluation( + StructsToJson(Map.empty, input), + """{"a":1}""" +) + } + + test("to_json - array with maps") { +val inputSchema = ArrayType(MapType(StringType, IntegerType)) +val input = new GenericArrayData(ArrayBasedMapData( + Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) :: Nil) +val output = """[{"a":1},{"b":2}]""" +checkEvaluation( + StructsToJson(Map.empty, Literal.create(input, inputSchema), gmtId), + output) + } + + test("to_json - array with single map") { +val inputSchema = ArrayType(MapType(StringType, IntegerType)) +val input = new GenericArrayData(ArrayBasedMapData( + Map("a" -> 1)) :: Nil) --- End diff -- Let's make this inlined. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138079126 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +226,26 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = { +writeObject(writeFields(row, dataType.asInstanceOf[StructType], rootFieldWriters)) + } + /** - * Transforms multiple `InternalRow`s to JSON array using Jackson + * Transforms multiple `InternalRow`s or `MapData`s to JSON array using Jackson * - * @param array The array of rows to convert + * @param array The array of rows or maps to convert */ def write(array: ArrayData): Unit = writeArray(writeArrayData(array, arrElementWriter)) + /** + * Transforms a `MapData` to JSON object using Jackson --- End diff -- Not a big deal but `` a `MapData` `` -> `` a single `MapData` `` just for consistency with ` write(row: InternalRow)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19182 **[Test build #3916 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3916/testReport)** for PR 19182 at commit [`b69a20b`](https://github.com/apache/spark/commit/b69a20bf5a1c52dc69e3bbc61c804d3846c72137). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19188 **[Test build #81637 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81637/testReport)** for PR 19188 at commit [`322c335`](https://github.com/apache/spark/commit/322c335a027edc8c6705b72fd76f00e6933a987f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18853: [SPARK-21646][SQL] CommonType for binary comparison
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18853 **[Test build #81638 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81638/testReport)** for PR 18853 at commit [`3bec6a2`](https://github.com/apache/spark/commit/3bec6a22565c58ad2da18d66e1e285d644f3577a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18538: [SPARK-14516][ML] Adding ClusteringEvaluator with the im...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18538 **[Test build #81639 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81639/testReport)** for PR 18538 at commit [`b0b7853`](https://github.com/apache/spark/commit/b0b7853d68c1c79bd49d6e290d3c96fe9e3af6ea). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19190: [SPARK-21976][DOC]
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19190 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138058970 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -26,20 +26,53 @@ import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ +/** + * `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + * Once it is initialized with `StructType`, it can be used to write out a struct or an array of + * struct. Once it is initialized with ``MapType``, it can be used to write out a map or an array + * of map. An exception will be thrown if trying to write out a struct if it is initialized with + * a `MapType`, and vice verse. + */ private[sql] class JacksonGenerator( -schema: StructType, +dataType: DataType, writer: Writer, options: JSONOptions) { // A `ValueWriter` is responsible for writing a field of an `InternalRow` to appropriate // JSON data. Here we are using `SpecializedGetters` rather than `InternalRow` so that // we can directly access data in `ArrayData` without the help of `SpecificMutableRow`. private type ValueWriter = (SpecializedGetters, Int) => Unit + // `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + dataType match { +case _: StructType | _: MapType => +case _ => throw new UnsupportedOperationException( + s"`JacksonGenerator` only supports to be initialized with a `StructType` " + --- End diff -- This `s` could be removed too. Let's avoid backquoting classes in the error messages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138073822 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/json/JacksonGeneratorSuite.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.json + +import java.io.CharArrayWriter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} +import org.apache.spark.sql.types._ + +class JacksonGeneratorSuite extends SparkFunSuite { + + val gmtId = DateTimeUtils.TimeZoneGMT.getID + val option = new JSONOptions(Map.empty, gmtId) + + test("initial with StructType and write out a row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = InternalRow(1) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """{"a":1}""") + } + + test("initial with StructType and write out rows") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(InternalRow(1) :: InternalRow(2) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{"a":1},{"a":2}]""") + } + + test("initial with StructType and write out an array with single empty row") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(InternalRow(null) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{}]""") + } + + test("initial with StructType and write out an empty array") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData(Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[]""") + } + + test("initial with Map and write out a map data") { +val dataType = MapType(StringType, IntegerType) +val input = ArrayBasedMapData(Map("a" -> 1)) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """{"a":1}""") + } + + test("initial with Map and write out an array of maps") { +val dataType = MapType(StringType, IntegerType) +val input = new GenericArrayData( + ArrayBasedMapData(Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) :: Nil) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +gen.write(input) +gen.flush() +assert(writer.toString === """[{"a":1},{"b":2}]""") + } + + test("error handling: initial with StructType but error calling write a map") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = ArrayBasedMapData(Map("a" -> 1)) +val writer = new CharArrayWriter() +val gen = new JacksonGenerator(dataType, writer, option) +intercept[Exception] { + gen.write(input) +} + } + + test("error handling: initial with StructType but error calling write an array of maps") { +val dataType = StructType(StructField("a", IntegerType) :: Nil) +val input = new GenericArrayData( + ArrayBasedMapData(Map("a" -> 1)) :: ArrayBasedMapData(Map("b" -> 2)) :: Nil) +val writer = new
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138077808 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala --- @@ -180,10 +180,30 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { test("to_json - array") { val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a") +val df2 = Seq(Tuple1(Map("a" -> 1) :: Nil)).toDF("a") checkAnswer( df.select(to_json($"a")), Row("""[{"_1":1}]""") :: Nil) +checkAnswer( + df2.select(to_json($"a")), + Row("""[{"a":1}]""") :: Nil) + } + + test("to_json - map") { +val df1 = Seq(Map("a" -> Tuple1(1))).toDF("a") +val df2 = Seq(Map(Tuple1(1) -> Tuple1(1))).toDF("a") +val df3 = Seq(Map("a" -> 1)).toDF("a") + +checkAnswer( + df1.select(to_json($"a")), + Row("""{"a":{"_1":1}}""") :: Nil) +checkAnswer( + df2.select(to_json($"a")), + Row("""{"[0,1]":{"_1":1}}""") :: Nil) --- End diff -- This case looks rather a string representation from `UnsafeRow` and a bug. Let's remove this test case for now and fix it later together in another PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138057979 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,25 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case _: MapType | ArrayType(_: MapType, _) => + // TODO: let `JacksonUtils.verifySchema` verify a `MapType` + try { +val st = StructType(StructField("a", rowSchema.asInstanceOf[MapType]) :: Nil) +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } case _ => TypeCheckResult.TypeCheckFailure( - s"Input type ${child.dataType.simpleString} must be a struct or array of structs.") + s"Input type ${child.dataType.simpleString} must be a struct, array of structs or " + + s"a map or array of map.") --- End diff -- Looks `s` could be removed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138058695 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -26,20 +26,53 @@ import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ +/** + * `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + * Once it is initialized with `StructType`, it can be used to write out a struct or an array of + * struct. Once it is initialized with ``MapType``, it can be used to write out a map or an array + * of map. An exception will be thrown if trying to write out a struct if it is initialized with + * a `MapType`, and vice verse. + */ private[sql] class JacksonGenerator( -schema: StructType, +dataType: DataType, writer: Writer, options: JSONOptions) { // A `ValueWriter` is responsible for writing a field of an `InternalRow` to appropriate // JSON data. Here we are using `SpecializedGetters` rather than `InternalRow` so that // we can directly access data in `ArrayData` without the help of `SpecificMutableRow`. private type ValueWriter = (SpecializedGetters, Int) => Unit + // `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + dataType match { --- End diff -- I'd just do a `require(... .isInstanceOf... || ..., "...")` instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r138058166 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -26,20 +26,53 @@ import org.apache.spark.sql.catalyst.expressions.SpecializedGetters import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} import org.apache.spark.sql.types._ +/** + * `JackGenerator` can only be initialized with a `StructType` or a `MapType`. + * Once it is initialized with `StructType`, it can be used to write out a struct or an array of + * struct. Once it is initialized with ``MapType``, it can be used to write out a map or an array --- End diff -- nit: ``` ``MapType`` ``` -> `` `MapType` `` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81635 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81635/testReport)** for PR 19147 at commit [`803054e`](https://github.com/apache/spark/commit/803054e9f30b057e1b194b5625cb6216d865f1d4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15178 **[Test build #81636 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81636/testReport)** for PR 15178 at commit [`93ecabb`](https://github.com/apache/spark/commit/93ecabb7c7c9b2fa42a321e28f834568cadf0272). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19147 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81635/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15178 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81636/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15178: [SPARK-17556][SQL] Executor side broadcast for broadcast...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15178 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19190: [SPARK-21976][DOC]
GitHub user FavioVazquez opened a pull request: https://github.com/apache/spark/pull/19190 [SPARK-21976][DOC] ## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala @Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/FavioVazquez/spark mae-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19190.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19190 commit b2e2f8caef4f43d62cb48fc09a0fe17e71f3f4dd Author: FavioVazquezDate: 2015-04-30T19:46:40Z Merge remote-tracking branch 'upstream/master' commit edab1ef07312cf44e26fccb7fca8f4a4977ad3ee Author: FavioVazquez Date: 2015-05-05T14:16:02Z Merge remote-tracking branch 'upstream/master' commit 9af7074235b6c13001924e037772195b640115b8 Author: FavioVazquez Date: 2015-05-15T13:58:04Z Merge remote-tracking branch 'upstream/master' commit f27a20b9f53b643d5c963729f56ee548d0c8e263 Author: FavioVazquez Date: 2015-06-04T16:10:00Z Merge remote-tracking branch 'upstream/master' commit ad882a378e24458832b961fd97eb4b7662203ef9 Author: FavioVazquez Date: 2015-06-09T12:47:59Z Merge remote-tracking branch 'upstream/master' commit 424a92853f44c34f310e0a9e8dd927d246bd9171 Author: FavioVazquez Date: 2015-06-17T14:56:35Z Merge remote-tracking branch 'upstream/master' commit 5311719db61454eee5e1715f44507c294509ec1c Author: FavioVazquez Date: 2015-09-21T06:43:29Z Merge remote-tracking branch 'upstream/master' commit d6b551b1d25cef07096b7e1fc22b659ed753d9dc Author: faviovazquez Date: 2017-09-11T13:36:27Z Merge remote-tracking branch 'upstream/master' commit a95cfc6c5e88f44429319b52462b537ae0bc1857 Author: Favio André Vázquez Date: 2017-09-11T13:49:03Z Fix doc for MAE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19182: [SPARK-21970][Core] Fix Redundant Throws Declarations in...
Github user original-brownbear commented on the issue: https://github.com/apache/spark/pull/19182 @srowen makes perfect sense => rolled back all changes to tests + publicly exposed methods (those package private ones adjusted are on non-public classes). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81621 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81621/testReport)** for PR 19147 at commit [`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81624 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81624/testReport)** for PR 18875 at commit [`f567679`](https://github.com/apache/spark/commit/f567679ec724052266587e35cdf25868587de750). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19186 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81620/testReport)** for PR 18875 at commit [`b8ad442`](https://github.com/apache/spark/commit/b8ad442437390611c63d8b0705d02c3ca3ba3c9c). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19186: [SPARK-21972][ML] Add param handlePersistence
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19186 **[Test build #81623 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81623/testReport)** for PR 19186 at commit [`f8fa957`](https://github.com/apache/spark/commit/f8fa9573a1b40ff236e9c52cf429e2742c8f2bd0). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81625 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81625/testReport)** for PR 18875 at commit [`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81630 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81630/testReport)** for PR 18875 at commit [`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19187: Branch 2.1
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19187 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19147: [WIP][SPARK-21190][SQL][PYTHON] Vectorized UDFs in Pytho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19147 **[Test build #81629 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81629/testReport)** for PR 19147 at commit [`dbc6dd2`](https://github.com/apache/spark/commit/dbc6dd2138b427d8436eb0d8bdc4ba134f254e35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19184 **[Test build #81628 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81628/testReport)** for PR 19184 at commit [`ea5f9d9`](https://github.com/apache/spark/commit/ea5f9d9903185690670da6384dbd6d2b08b8177f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19184: [SPARK-21971][CORE] Too many open files in Spark due to ...
Github user rajeshbalamohan commented on the issue: https://github.com/apache/spark/pull/19184 Thanks @viirya . I have updated the patch to address your comments. This fixes the "too many files open" issue for (e.g Q67, Q72, Q14 etc) which involves window functions; but for the merger the issue needs to be addressed still. Agreed that this would be partial patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137986264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -677,14 +696,25 @@ case class StructsToJson( override def checkInputDataTypes(): TypeCheckResult = child.dataType match { case _: StructType | ArrayType(_: StructType, _) => try { -JacksonUtils.verifySchema(rowSchema) +JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) +TypeCheckResult.TypeCheckSuccess + } catch { +case e: UnsupportedOperationException => + TypeCheckResult.TypeCheckFailure(e.getMessage) + } +case _: MapType | ArrayType(_: MapType, _) => + // TODO: let `JacksonUtils.verifySchema` verify a `MapType` + try { +val st = StructType(StructField("a", rowSchema.asInstanceOf[MapType]) :: Nil) +JacksonUtils.verifySchema(st) TypeCheckResult.TypeCheckSuccess } catch { case e: UnsupportedOperationException => TypeCheckResult.TypeCheckFailure(e.getMessage) } case _ => TypeCheckResult.TypeCheckFailure( - s"Input type ${child.dataType.simpleString} must be a struct or array of structs.") + s"Input type ${child.dataType.simpleString} must be a struct, array of structs or " + + s"an arbitrary map.") --- End diff -- `an arbitrary map` -> `a map or array of map.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18875: [SPARK-21513][SQL] Allow UDF to_json support conv...
Github user goldmedal commented on a diff in the pull request: https://github.com/apache/spark/pull/18875#discussion_r137987843 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -193,14 +228,32 @@ private[sql] class JacksonGenerator( * * @param row The row to convert */ - def write(row: InternalRow): Unit = writeObject(writeFields(row, schema, rootFieldWriters)) + def write(row: InternalRow): Unit = dataType match { +case st: StructType => + writeObject(writeFields(row, st, rootFieldWriters)) +case _ => throw new UnsupportedOperationException( --- End diff -- It will throw `ClassCastException`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16677: [SPARK-19355][SQL] Use map output statistics to improve ...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16677 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81625 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81625/testReport)** for PR 18875 at commit [`b71b56a`](https://github.com/apache/spark/commit/b71b56ab49ba048f397459a18f946422d9440b29). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18875: [SPARK-21513][SQL] Allow UDF to_json support converting ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18875 **[Test build #81631 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/81631/testReport)** for PR 18875 at commit [`be62e51`](https://github.com/apache/spark/commit/be62e51109c8a438fed8d787fb2986aeb1974f82). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19188: [SPARK-21973][SQL] Add an new option to filter queries i...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19188 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/81633/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org