[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r161679707 --- Diff: python/run-tests-with-coverage --- @@ -0,0 +1,69 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +set -o pipefail +set -e + +# This variable indicates which coverage executable to run to combine coverages +# and generate HTMLs, for example, 'coverage3' in Python 3. +COV_EXEC="${COV_EXEC:-coverage}" +FWDIR="$(cd "`dirname $0`"; pwd)" +pushd "$FWDIR" > /dev/null --- End diff -- I see, no problem at all. I just wanted to confirm. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20153 **[Test build #86160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86160/testReport)** for PR 20153 at commit [`d666110`](https://github.com/apache/spark/commit/d6661104f314c88ff84057fd4830e7a5fbe964d9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17280: [SPARK-19939] [ML] Add support for association ru...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17280#discussion_r161679593 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -319,9 +323,11 @@ object FPGrowthModel extends MLReadable[FPGrowthModel] { override def load(path: String): FPGrowthModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + implicit val format = DefaultFormats + val numTrainingRecords = (metadata.metadata \ "numTrainingRecords").extract[Long] --- End diff -- Does this break backward compatibility for loading? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20153: [SPARK-22392][SQL] data source v2 columnar batch reader
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20153 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r161677975 --- Diff: python/run-tests-with-coverage --- @@ -0,0 +1,69 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +set -o pipefail +set -e + +# This variable indicates which coverage executable to run to combine coverages +# and generate HTMLs, for example, 'coverage3' in Python 3. +COV_EXEC="${COV_EXEC:-coverage}" +FWDIR="$(cd "`dirname $0`"; pwd)" +pushd "$FWDIR" > /dev/null --- End diff -- my 2 c: I think it's ok, I'd prefer it; might be useful in the future when more cd are added --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20232 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86154/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20232 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20232 **[Test build #86154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86154/testReport)** for PR 20232 at commit [`20bbf64`](https://github.com/apache/spark/commit/20bbf64e0ce99538f80e5b6f360a69de93f4d9fc). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OneHotEncoderEstimator(JavaEstimator, HasInputCols, HasOutputCols, HasHandleInvalid,` * `class OneHotEncoderModel(JavaModel, JavaMLReadable, JavaMLWritable):` * `class HasOutputCols(Params):` * `sealed trait Distribution ` * `case class HashClusteredDistribution(expressions: Seq[Expression]) extends Distribution ` * `case class BroadcastDistribution(mode: BroadcastMode) extends Distribution ` * `case class UnknownPartitioning(numPartitions: Int) extends Partitioning` * `case class RoundRobinPartitioning(numPartitions: Int) extends Partitioning` * `public class OrcColumnVector extends org.apache.spark.sql.vectorized.ColumnVector ` * `public class OrcColumnarBatchReader extends RecordReader` * `class StreamingDataSourceV2Relation(` * `case class RateStreamPartitionOffset(` * `class RateStreamContinuousReader(options: DataSourceV2Options)` * `case class RateStreamContinuousReadTask(` * `class RateStreamContinuousDataReader(` * `class RateSourceProviderV2 extends DataSourceV2 with MicroBatchReadSupport with DataSourceRegister ` * `class RateStreamMicroBatchReader(options: DataSourceV2Options)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20163: [SPARK-22966][PYTHON][SQL] Python UDFs with retur...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/20163#discussion_r161677327 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala --- @@ -144,6 +145,7 @@ object EvaluatePython { } case StringType => (obj: Any) => nullSafeConvert(obj) { + case _: Calendar => null case _ => UTF8String.fromString(obj.toString) --- End diff -- btw, the array case seems a bit weird? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20267: [SPARK-23068][BUILD][RELEASE] doc build error from jekyl...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20267 I see. Thanks for info! So, is it ready to go anyway @felixcheung? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20265 I'll update the PR tomorrow. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20275 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20275 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86158/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20275 **[Test build #86158 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86158/testReport)** for PR 20275 at commit [`1a3cd3a`](https://github.com/apache/spark/commit/1a3cd3aab355a00a73993979896624a8684a9aad). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20168 Overall looks good to me. Just some minor comments regarding with code comments and naming. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20168 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86156/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20168 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #86156 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86156/testReport)** for PR 20168 at commit [`896ccc2`](https://github.com/apache/spark/commit/896ccc21582f1610e38dc91a67eca90c8914e2e5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20265#discussion_r161671821 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + */ +// scalastyle:off line.size.limit +object FilterPushdownBenchmark { + val conf = new SparkConf() + conf.set("orc.compression", "snappy") + conf.set("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder() +.master("local[1]") +.appName("FilterPushdownBenchmark") +.config(conf) +.getOrCreate() + + // Set default configs. Individual cases will change them if necessary. + spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true") + spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true") + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable(dir: File, df: DataFrame): Unit = { +val dirORC = dir.getCanonicalPath + "/orc" +val dirParquet = dir.getCanonicalPath + "/parquet" + +df.write.mode("overwrite").orc(dirORC) +df.write.mode("overwrite").parquet(dirParquet) + +spark.read.orc(dirORC).createOrReplaceTempView("orcTable") +spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit = { +val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values) + +withTempPath { dir => + withTempTable("t1", "orcTable", "patquetTable") { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val df = spark.range(values).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("id", monotonically_increasing_id()) + +df.createOrReplaceTempView("t1") +prepareTable(dir, spark.sql("SELECT * FROM t1")) + +Seq(false, true).foreach { value => + benchmark.addCase(s"Parquet Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect() +} + } +} + +Seq(false, true).foreach { value => + benchmark.addCase(s"Native ORC Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect() +} + } +} + +// Positive cases: Select one or no rows +/* +Java HotSpot(TM)
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86157/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20164 **[Test build #86157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86157/testReport)** for PR 20164 at commit [`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20265#discussion_r161672411 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + */ +// scalastyle:off line.size.limit +object FilterPushdownBenchmark { + val conf = new SparkConf() + conf.set("orc.compression", "snappy") + conf.set("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder() +.master("local[1]") +.appName("FilterPushdownBenchmark") +.config(conf) +.getOrCreate() + + // Set default configs. Individual cases will change them if necessary. + spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true") + spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true") + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable(dir: File, df: DataFrame): Unit = { +val dirORC = dir.getCanonicalPath + "/orc" +val dirParquet = dir.getCanonicalPath + "/parquet" + +df.write.mode("overwrite").orc(dirORC) +df.write.mode("overwrite").parquet(dirParquet) + +spark.read.orc(dirORC).createOrReplaceTempView("orcTable") +spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit = { +val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values) + +withTempPath { dir => + withTempTable("t1", "orcTable", "patquetTable") { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val df = spark.range(values).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("id", monotonically_increasing_id()) + +df.createOrReplaceTempView("t1") +prepareTable(dir, spark.sql("SELECT * FROM t1")) + +Seq(false, true).foreach { value => + benchmark.addCase(s"Parquet Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect() +} + } +} + +Seq(false, true).foreach { value => + benchmark.addCase(s"Native ORC Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect() +} + } +} + +// Positive cases: Select one or no rows +/* +Java HotSpot(TM)
[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20265#discussion_r161672316 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + */ +// scalastyle:off line.size.limit +object FilterPushdownBenchmark { + val conf = new SparkConf() + conf.set("orc.compression", "snappy") + conf.set("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder() +.master("local[1]") +.appName("FilterPushdownBenchmark") +.config(conf) +.getOrCreate() + + // Set default configs. Individual cases will change them if necessary. + spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true") + spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true") + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable(dir: File, df: DataFrame): Unit = { +val dirORC = dir.getCanonicalPath + "/orc" +val dirParquet = dir.getCanonicalPath + "/parquet" + +df.write.mode("overwrite").orc(dirORC) +df.write.mode("overwrite").parquet(dirParquet) + +spark.read.orc(dirORC).createOrReplaceTempView("orcTable") +spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit = { +val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values) + +withTempPath { dir => + withTempTable("t1", "orcTable", "patquetTable") { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val df = spark.range(values).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("id", monotonically_increasing_id()) + +df.createOrReplaceTempView("t1") +prepareTable(dir, spark.sql("SELECT * FROM t1")) + +Seq(false, true).foreach { value => + benchmark.addCase(s"Parquet Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect() +} + } +} + +Seq(false, true).foreach { value => + benchmark.addCase(s"Native ORC Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect() +} + } +} + +// Positive cases: Select one or no rows +/* +Java HotSpot(TM)
[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20265#discussion_r161671868 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + */ +// scalastyle:off line.size.limit +object FilterPushdownBenchmark { + val conf = new SparkConf() + conf.set("orc.compression", "snappy") + conf.set("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder() +.master("local[1]") +.appName("FilterPushdownBenchmark") +.config(conf) +.getOrCreate() + + // Set default configs. Individual cases will change them if necessary. + spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true") + spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true") + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable(dir: File, df: DataFrame): Unit = { +val dirORC = dir.getCanonicalPath + "/orc" +val dirParquet = dir.getCanonicalPath + "/parquet" + +df.write.mode("overwrite").orc(dirORC) +df.write.mode("overwrite").parquet(dirParquet) + +spark.read.orc(dirORC).createOrReplaceTempView("orcTable") +spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit = { +val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values) + +withTempPath { dir => + withTempTable("t1", "orcTable", "patquetTable") { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val df = spark.range(values).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("id", monotonically_increasing_id()) + +df.createOrReplaceTempView("t1") +prepareTable(dir, spark.sql("SELECT * FROM t1")) + +Seq(false, true).foreach { value => --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20265: [SPARK-21783][SQL] Turn on ORC filter push-down b...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20265#discussion_r161671835 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FilterPushdownBenchmark.scala --- @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + */ +// scalastyle:off line.size.limit +object FilterPushdownBenchmark { + val conf = new SparkConf() + conf.set("orc.compression", "snappy") + conf.set("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder() +.master("local[1]") +.appName("FilterPushdownBenchmark") +.config(conf) +.getOrCreate() + + // Set default configs. Individual cases will change them if necessary. + spark.conf.set(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key, "true") + spark.conf.set(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key, "true") + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable(dir: File, df: DataFrame): Unit = { +val dirORC = dir.getCanonicalPath + "/orc" +val dirParquet = dir.getCanonicalPath + "/parquet" + +df.write.mode("overwrite").orc(dirORC) +df.write.mode("overwrite").parquet(dirParquet) + +spark.read.orc(dirORC).createOrReplaceTempView("orcTable") +spark.read.parquet(dirParquet).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark(values: Int, width: Int, expr: String): Unit = { +val benchmark = new Benchmark(s"Filter Pushdown ($expr)", values) + +withTempPath { dir => + withTempTable("t1", "orcTable", "patquetTable") { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val df = spark.range(values).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("id", monotonically_increasing_id()) + +df.createOrReplaceTempView("t1") +prepareTable(dir, spark.sql("SELECT * FROM t1")) + +Seq(false, true).foreach { value => + benchmark.addCase(s"Parquet Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM parquetTable WHERE $expr").collect() +} + } +} + +Seq(false, true).foreach { value => + benchmark.addCase(s"Native ORC Vectorized ${if (value) s"(Pushdown)" else ""}") { _ => +withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> s"$value") { + spark.sql(s"SELECT * FROM orcTable WHERE $expr").collect() +} + } +} + +// Positive cases: Select one or no rows +/* +Java HotSpot(TM)
[GitHub] spark pull request #20223: [SPARK-23020][core] Fix races in launcher code, t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20223 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20223: [SPARK-23020][core] Fix races in launcher code, test.
Github user sameeragarwal commented on the issue: https://github.com/apache/spark/pull/20223 merging to master/2.3. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20266 **[Test build #86159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86159/testReport)** for PR 20266 at commit [`5afaa28`](https://github.com/apache/spark/commit/5afaa2836133cfc18a52de38d666817991d62c5d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...
Github user ajbozarth commented on the issue: https://github.com/apache/spark/pull/20216 LGTM now --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20266#discussion_r161668457 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.SharedSQLContext + +class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext { + import testImplicits._ + + Seq("orc", "parquet", "csv", "json", "text").foreach { format => +test(s"Writing empty datasets should not fail - $format") { + withTempDir { dir => + Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp") --- End diff -- Yep. It's fixed by using `withTempPath`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20266#discussion_r161668286 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.SharedSQLContext + +class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext { + import testImplicits._ + + Seq("orc", "parquet", "csv", "json", "text").foreach { format => +test(s"Writing empty datasets should not fail - $format") { + withTempDir { dir => + Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp") + } +} + } + + Seq("orc", "parquet", "csv", "json").foreach { format => +test(s"Write and read back unicode schema - $format") { + withTempPath { path => +val dir = path.getCanonicalPath + +// scalastyle:off nonascii +val df = Seq("a").toDF("íê¸") +// scalastyle:on nonascii + +df.write.format(format).option("header", "true").save(dir) +val answerDf = spark.read.format(format).option("header", "true").load(dir) + +assert(df.schema === answerDf.schema) +checkAnswer(df, answerDf) + } +} + } + + // Only New OrcFileFormat supports this + Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName, --- End diff -- Yep. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20216: [SPARK-23024][WEB-UI]Spark ui about the contents of the ...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/20216 I agree with your second suggestion, before I did not understand what you mean, now I passed the test I understand what you mean. 1.In order for collapsible tables to persist on reload each table much be added to the function at the bottom on web.js. When I refresh the page, if it is hidden, will still be hidden; if it is displayed, will still be displayed. 2.to ensure user interface consistency. @ajbozarth @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20275 **[Test build #86158 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86158/testReport)** for PR 20275 at commit [`1a3cd3a`](https://github.com/apache/spark/commit/1a3cd3aab355a00a73993979896624a8684a9aad). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20168 Btw, I think this isn't only to add non-integer image formats. So the PR title may be changed too. Like "Add ImageSchema support for all OpenCV image types"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20275: [SPARK-23085][ML] API parity for mllib.linalg.Vec...
GitHub user zhengruifeng opened a pull request: https://github.com/apache/spark/pull/20275 [SPARK-23085][ML] API parity for mllib.linalg.Vectors.sparse ## What changes were proposed in this pull request? `ML.Vectors#sparse(size: Int, elements: Seq[(Int, Double)])` support zero-length ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhengruifeng/spark SparseVector_size Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20275.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20275 commit 8b2876e5c5059a1fb258bff53ae6667df80d3205 Author: Zheng RuiFengDate: 2018-01-16T01:46:47Z nit commit 1a3cd3aab355a00a73993979896624a8684a9aad Author: Zheng RuiFeng Date: 2018-01-16T05:49:40Z update pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161665015 --- Diff: python/pyspark/ml/image.py --- @@ -128,11 +183,17 @@ def toNDArray(self, image): height = image.height width = image.width nChannels = image.nChannels +ocvType = self.ocvTypeByMode(image.mode) +if nChannels != ocvType.nChannels: +raise ValueError( +"Image has %d channels but OcvType '%s' expects %d channels." % --- End diff -- `Image has %d channels but its OcvType ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161665832 --- Diff: python/pyspark/ml/tests.py --- @@ -1843,6 +1844,28 @@ def tearDown(self): class ImageReaderTest(SparkSessionTestCase): +def test_ocv_types(self): +ocvList = ImageSchema.ocvTypes +self.assertEqual("Undefined", ocvList[0].name) +self.assertEqual(-1, ocvList[0].mode) +self.assertEqual("N/A", ocvList[0].dataType) +for x in ocvList: +self.assertEqual(x, ImageSchema.ocvTypeByName(x.name)) +self.assertEqual(x, ImageSchema.ocvTypeByMode(x.mode)) + +def test_conversions(self): +s = np.random.RandomState(seed=987) +ary_src = s.rand(4, 10, 10) --- End diff -- ary_src -> array_src? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161666778 --- Diff: python/pyspark/ml/tests.py --- @@ -1843,6 +1844,28 @@ def tearDown(self): class ImageReaderTest(SparkSessionTestCase): +def test_ocv_types(self): +ocvList = ImageSchema.ocvTypes +self.assertEqual("Undefined", ocvList[0].name) +self.assertEqual(-1, ocvList[0].mode) +self.assertEqual("N/A", ocvList[0].dataType) +for x in ocvList: +self.assertEqual(x, ImageSchema.ocvTypeByName(x.name)) +self.assertEqual(x, ImageSchema.ocvTypeByMode(x.mode)) + +def test_conversions(self): +s = np.random.RandomState(seed=987) +ary_src = s.rand(4, 10, 10) --- End diff -- s.rand(4, 10, 10) -> s.rand(10, 10, 4)? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664859 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): +""" +Return the supported OpenCvType with matching name or raise error if there is no matching type. + +:param: str name: OpenCv type name; must be equal to name of one of the supported types. +:return: OpenCvType with matching name. + +""" + +if self._ocvTypesByName is None: +self._ocvTypesByName = {x.name: x for x in self.ocvTypes} +if name not in self._ocvTypesByName: +raise ValueError( +"Can not find matching OpenCvFormat for type = '%s'; supported formats are = %s" % +(name, str(self._ocvTypesByName.keys( +return self._ocvTypesByName[name] + +def ocvTypeByMode(self, mode): +""" +Return the supported OpenCvType with matching mode or raise error if there is no matching type. + +:param: int mode: OpenCv type mode; must be equal to mode of one of the supported types. +:return: OpenCvType with matching mode. --- End diff -- `OpenCvType` -> `OcvType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161661795 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664852 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): +""" +Return the supported OpenCvType with matching name or raise error if there is no matching type. + +:param: str name: OpenCv type name; must be equal to name of one of the supported types. +:return: OpenCvType with matching name. + +""" + +if self._ocvTypesByName is None: +self._ocvTypesByName = {x.name: x for x in self.ocvTypes} +if name not in self._ocvTypesByName: +raise ValueError( +"Can not find matching OpenCvFormat for type = '%s'; supported formats are = %s" % +(name, str(self._ocvTypesByName.keys( +return self._ocvTypesByName[name] + +def ocvTypeByMode(self, mode): +""" +Return the supported OpenCvType with matching mode or raise error if there is no matching type. --- End diff -- `OpenCvType` -> `OcvType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664806 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type + * @param nChannels number of color channels + */ + case class OpenCvType(mode: Int, dataType: String, nChannels: Int) { +def name: String = if (mode == -1) { "Undefined" } else { s"CV_$dataType" + s"C$nChannels" } +override def toString: String = s"OpenCvType(mode = $mode, name = $name)" + } /** - * (Scala-specific) OpenCV type mapping supported + * Return the supported OpenCvType with matching name or raise error if there is no matching type. + * + * @param name: name of existing OpenCvType + * @return OpenCvType that matches the given name */ - val ocvTypes: Map[String, Int] = Map( -undefinedImageType -> -1, -"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 - ) + def ocvTypeByName(name: String): OpenCvType = { +ocvTypes.find(x => x.name == name).getOrElse( + throw new IllegalArgumentException("Unknown open cv type " + name)) + } + + /** + * Return the supported OpenCvType with matching mode or raise error if there is no matching type. + * + * @param mode: mode of existing OpenCvType + * @return OpenCvType that matches the given mode + */ + def ocvTypeByMode(mode: Int): OpenCvType = { --- End diff -- `getOcvTypeByMode` or `findOcvTypeByMode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664786 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type + * @param nChannels number of color channels + */ + case class OpenCvType(mode: Int, dataType: String, nChannels: Int) { +def name: String = if (mode == -1) { "Undefined" } else { s"CV_$dataType" + s"C$nChannels" } +override def toString: String = s"OpenCvType(mode = $mode, name = $name)" + } /** - * (Scala-specific) OpenCV type mapping supported + * Return the supported OpenCvType with matching name or raise error if there is no matching type. + * + * @param name: name of existing OpenCvType + * @return OpenCvType that matches the given name */ - val ocvTypes: Map[String, Int] = Map( -undefinedImageType -> -1, -"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 - ) + def ocvTypeByName(name: String): OpenCvType = { --- End diff -- `getOcvTypeByName` or `findOcvTypeByName`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161663005 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type + * @param nChannels number of color channels + */ + case class OpenCvType(mode: Int, dataType: String, nChannels: Int) { +def name: String = if (mode == -1) { "Undefined" } else { s"CV_$dataType" + s"C$nChannels" } +override def toString: String = s"OpenCvType(mode = $mode, name = $name)" + } /** - * (Scala-specific) OpenCV type mapping supported + * Return the supported OpenCvType with matching name or raise error if there is no matching type. + * + * @param name: name of existing OpenCvType + * @return OpenCvType that matches the given name */ - val ocvTypes: Map[String, Int] = Map( -undefinedImageType -> -1, -"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 - ) + def ocvTypeByName(name: String): OpenCvType = { +ocvTypes.find(x => x.name == name).getOrElse( + throw new IllegalArgumentException("Unknown open cv type " + name)) + } + + /** + * Return the supported OpenCvType with matching mode or raise error if there is no matching type. + * + * @param mode: mode of existing OpenCvType + * @return OpenCvType that matches the given mode + */ + def ocvTypeByMode(mode: Int): OpenCvType = { +ocvTypes.find(x => x.mode == mode).getOrElse( + throw new IllegalArgumentException("Unknown open cv mode " + mode)) + } + + val undefinedImageType = OpenCvType(-1, "N/A", -1) + + /** + * A Mapping of Type to Numbers in OpenCV + * + *C1 C2 C3 C4 --- End diff -- Add a brief header for row/column. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664731 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): +""" +Return the supported OpenCvType with matching name or raise error if there is no matching type. + +:param: str name: OpenCv type name; must be equal to name of one of the supported types. +:return: OpenCvType with matching name. + +""" + +if self._ocvTypesByName is None: +self._ocvTypesByName = {x.name: x for x in self.ocvTypes} +if name not in self._ocvTypesByName: +raise ValueError( +"Can not find matching OpenCvFormat for type = '%s'; supported formats are = %s" % +(name, str(self._ocvTypesByName.keys( +return self._ocvTypesByName[name] + +def ocvTypeByMode(self, mode): --- End diff -- `getOcvTypeByMode` or `findOcvTypeByMode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161665337 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): --- End diff -- `getOcvTypeByName` or `findOcvTypeByName`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664566 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): +""" +Return the supported OpenCvType with matching name or raise error if there is no matching type. --- End diff -- `OpenCvType` -> `OcvType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161662177 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type + * @param nChannels number of color channels + */ + case class OpenCvType(mode: Int, dataType: String, nChannels: Int) { +def name: String = if (mode == -1) { "Undefined" } else { s"CV_$dataType" + s"C$nChannels" } +override def toString: String = s"OpenCvType(mode = $mode, name = $name)" + } /** - * (Scala-specific) OpenCV type mapping supported + * Return the supported OpenCvType with matching name or raise error if there is no matching type. + * + * @param name: name of existing OpenCvType + * @return OpenCvType that matches the given name */ - val ocvTypes: Map[String, Int] = Map( -undefinedImageType -> -1, -"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 - ) + def ocvTypeByName(name: String): OpenCvType = { +ocvTypes.find(x => x.name == name).getOrElse( + throw new IllegalArgumentException("Unknown open cv type " + name)) + } + + /** + * Return the supported OpenCvType with matching mode or raise error if there is no matching type. + * + * @param mode: mode of existing OpenCvType + * @return OpenCvType that matches the given mode + */ + def ocvTypeByMode(mode: Int): OpenCvType = { +ocvTypes.find(x => x.mode == mode).getOrElse( + throw new IllegalArgumentException("Unknown open cv mode " + mode)) + } + + val undefinedImageType = OpenCvType(-1, "N/A", -1) + + /** + * A Mapping of Type to Numbers in OpenCV + * + *C1 C2 C3 C4 + * CV_8U 0 8 16 24 + * CV_8S 1 9 17 25 + * CV_16U 2 10 18 26 + * CV_16S 3 11 19 27 + * CV_32S 4 12 20 28 + * CV_32F 5 13 21 29 + * CV_64F 6 14 22 30 + */ + val ocvTypes: IndexedSeq[OpenCvType] = { +val types = + for (nc <- Array(1, 2, 3, 4); --- End diff -- `numChannel` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664060 --- Diff: mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala --- @@ -83,7 +83,8 @@ class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext { val bytes20 = getData(row).slice(0, 20) val (expectedMode, expectedBytes) = firstBytes20(filename) --- End diff -- Since you use `ocvTypeByName` below to look up for it, it should be named as `expectedType` or `expectedTypeName`, other than `expectedMode`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161663481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation --- End diff -- Add a reference link for OpenCV data type? Like this one: https://docs.opencv.org/2.4/modules/core/doc/basic_structures.html --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161664597 --- Diff: python/pyspark/ml/image.py --- @@ -55,25 +72,66 @@ def imageSchema(self): """ if self._imageSchema is None: -ctx = SparkContext._active_spark_context +ctx = SparkContext.getOrCreate() jschema = ctx._jvm.org.apache.spark.ml.image.ImageSchema.imageSchema() self._imageSchema = _parse_datatype_json_string(jschema.json()) return self._imageSchema @property def ocvTypes(self): """ -Returns the OpenCV type mapping supported. +Return the supported OpenCV types. -:return: a dictionary containing the OpenCV type mapping supported. +:return: a list containing the supported OpenCV types. .. versionadded:: 2.3.0 """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + + +def ocvTypeByName(self, name): +""" +Return the supported OpenCvType with matching name or raise error if there is no matching type. + +:param: str name: OpenCv type name; must be equal to name of one of the supported types. +:return: OpenCvType with matching name. --- End diff -- `OpenCvType` -> `OcvType`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20249 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86152/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20249 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20249: [SPARK-23057][SPARK-19235][SQL] SET LOCATION should chan...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20249 **[Test build #86152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86152/testReport)** for PR 20249 at commit [`90c4980`](https://github.com/apache/spark/commit/90c49809886e2f487dc4c4dc6ba45aa16bae8933). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20150: [SPARK-22956][SS] Bug fix for 2 streams union fai...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20150 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/20150 Thanks for your review! Shixiong --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20150: [SPARK-22956][SS] Bug fix for 2 streams union failover s...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/20150 Thanks! Merging to master and 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #86156 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86156/testReport)** for PR 20168 at commit [`896ccc2`](https://github.com/apache/spark/commit/896ccc21582f1610e38dc91a67eca90c8914e2e5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20164 **[Test build #86157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86157/testReport)** for PR 20164 at commit [`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user zhengruifeng commented on the issue: https://github.com/apache/spark/pull/20164 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20258: [SPARK-23060][Python] New feature - apply method to exte...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20258 Oh, I see! Yea, they look quite same. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20258: [SPARK-23060][Python] New feature - apply method to exte...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/20258 Is this similar to `Dataset.transform()` in Java/Scala API? But we don't have similar APIs for RDDs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20164 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86151/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20164: [SPARK-22971][ML] OneVsRestModel should use temporary Ra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20164 **[Test build #86151 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86151/testReport)** for PR 20164 at commit [`e44d764`](https://github.com/apache/spark/commit/e44d7647fe6596c70a28527f893bdbdcb373c190). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20232 **[Test build #86155 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86155/testReport)** for PR 20232 at commit [`77af798`](https://github.com/apache/spark/commit/77af798ab356c23cd4a65c6c7c3f93c66c026982). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20138: [SPARK-20664][core] Delete stale application data...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/20138#discussion_r161660926 --- Diff: core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala --- @@ -663,6 +665,95 @@ class FsHistoryProviderSuite extends SparkFunSuite with BeforeAndAfter with Matc freshUI.get.ui.store.job(0) } + test("clean up stale app information") { +val storeDir = Utils.createTempDir() +val conf = createTestConf().set(LOCAL_STORE_DIR, storeDir.getAbsolutePath()) +val provider = spy(new FsHistoryProvider(conf)) +val appId = "new1" + +// Write logs for two app attempts. +doReturn(1L).when(provider).getNewLastScanTime() +val attempt1 = newLogFile(appId, Some("1"), inProgress = false) +writeFile(attempt1, true, None, + SparkListenerApplicationStart(appId, Some(appId), 1L, "test", Some("1")), + SparkListenerJobStart(0, 1L, Nil, null), + SparkListenerApplicationEnd(5L) + ) +val attempt2 = newLogFile(appId, Some("2"), inProgress = false) +writeFile(attempt2, true, None, + SparkListenerApplicationStart(appId, Some(appId), 1L, "test", Some("2")), + SparkListenerJobStart(0, 1L, Nil, null), + SparkListenerApplicationEnd(5L) + ) +updateAndCheck(provider) { list => + assert(list.size === 1) + assert(list(0).id === appId) + assert(list(0).attempts.size === 2) +} + +// Load the app's UI. +val ui = provider.getAppUI(appId, Some("1")) +assert(ui.isDefined) + +// Delete the underlying log file for attempt 1 and rescan. The UI should go away, but since +// attempt 2 still exists, listing data should be there. +doReturn(2L).when(provider).getNewLastScanTime() +attempt1.delete() +updateAndCheck(provider) { list => + assert(list.size === 1) + assert(list(0).id === appId) + assert(list(0).attempts.size === 1) +} +assert(!ui.get.valid) +assert(provider.getAppUI(appId, None) === None) + +// Delete the second attempt's log file. Now everything should go away. +doReturn(3L).when(provider).getNewLastScanTime() +attempt2.delete() +updateAndCheck(provider) { list => + assert(list.isEmpty) +} + } + + test("SPARK-21571: clean up removes invalid history files") { +val clock = new ManualClock(TimeUnit.DAYS.toMillis(120)) +val conf = createTestConf().set("spark.history.fs.cleaner.maxAge", s"2d") +val provider = new FsHistoryProvider(conf, clock) { + override def getNewLastScanTime(): Long = clock.getTimeMillis() +} + +// Create 0-byte size inprogress and complete files +val logfile1 = newLogFile("emptyInprogressLogFile", None, inProgress = true) +logfile1.createNewFile() +logfile1.setLastModified(clock.getTimeMillis()) + +val logfile2 = newLogFile("emptyFinishedLogFile", None, inProgress = false) +logfile2.createNewFile() +logfile2.setLastModified(clock.getTimeMillis()) + +// Create an incomplete log file, has an end record but no start record. +val logfile3 = newLogFile("nonEmptyCorruptLogFile", None, inProgress = false) +writeFile(logfile3, true, None, SparkListenerApplicationEnd(0)) +logfile3.setLastModified(clock.getTimeMillis()) + +provider.checkForLogs() +provider.cleanLogs() +assert(new File(testDir.toURI).listFiles().size === 3) + +// Move the clock forward 1 day and scan the files again. They should still be there. +clock.advance(TimeUnit.DAYS.toMillis(1)) +provider.checkForLogs() +provider.cleanLogs() +assert(new File(testDir.toURI).listFiles().size === 3) + +// Move the clock forward another 2 days and scan the files again. This time the cleaner should +// pick up the invalid files and get rid of them. +clock.advance(TimeUnit.DAYS.toMillis(2)) +provider.checkForLogs() +provider.cleanLogs() +assert(new File(testDir.toURI).listFiles().size === 0) --- End diff -- I think you should add a case where one file starts out empty, say even for one full day, but then becomes valid before the expiration time, and make sure it does *not* get cleaned up. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encod...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20232#discussion_r161660735 --- Diff: R/pkg/tests/fulltests/test_mllib_classification.R --- @@ -382,10 +382,10 @@ test_that("spark.mlp", { trainidxs <- base::sample(nrow(data), nrow(data) * 0.7) traindf <- as.DataFrame(data[trainidxs, ]) testdf <- as.DataFrame(rbind(data[-trainidxs, ], c(0, "the other"))) - model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 3)) + model <- spark.mlp(traindf, clicked ~ ., layers = c(1, 2)) --- End diff -- Added. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r161659513 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+---+ +Notes on grouping column: --- End diff -- It's interesting to see the discussion in `SPARK-16258`. I think this is quite hard for the API to meet all cases...But the `foo(key, pdf)` is the best so far I think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20171#discussion_r161659245 --- Diff: python/pyspark/sql/catalog.py --- @@ -256,27 +258,58 @@ def registerFunction(self, name, f, returnType=StringType()): >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.register("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] + >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.udf.register("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(()=26)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.register("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP +[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] """ # This is to check whether the input function is a wrapped/native UserDefinedFunction if hasattr(f, 'asNondeterministic'): -udf = UserDefinedFunction(f.func, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF, - deterministic=f.deterministic) +if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, + PythonEvalType.SQL_PANDAS_SCALAR_UDF]: +raise ValueError( +"Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF") +if returnType is not None and not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) +if returnType is not None and returnType != f.returnType: --- End diff -- I see what you mean. Now I became neutral but slightly on your side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r161659200 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+---+ +Notes on grouping column: --- End diff -- Sorry for the late reply. I agree with @HyukjinKwon, I think we can do support both `foo(pdf)` and `foo(key, pdf)` through inspection. I will try to put up a PR soon. As to how to represent key, I think a tuple might be enough but I think a row also works. What do you guys think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19872: [SPARK-22274][PYTHON][SQL] User-defined aggregati...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/19872#discussion_r161658799 --- Diff: python/pyspark/sql/functions.py --- @@ -2214,6 +2216,37 @@ def pandas_udf(f=None, returnType=None, functionType=None): .. seealso:: :meth:`pyspark.sql.GroupedData.apply` +3. GROUP_AGG + + A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar + The returnType should be a primitive data type, e.g, `DoubleType()`. + The returned scalar can be either a python primitive type, e.g., `int` or `float` + or a numpy data type, e.g., `numpy.int64` or `numpy.float64`. + + StructType and ArrayType are currently not supported. + + Group aggregate UDFs are used with :meth:`pyspark.sql.GroupedData.agg` + + >>> from pyspark.sql.functions import pandas_udf, PandasUDFType + >>> df = spark.createDataFrame( + ... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], + ... ("id", "v")) + >>> @pandas_udf("double", PandasUDFType.GROUP_AGG) + ... def mean_udf(v): --- End diff -- Sorry @cloud-fan, I don't understand this comment, could you elaborate? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20056: [SPARK-22878] [CORE] Count totalDroppedEvents for LiveLi...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20056 I see that `LiveListenerBus.droppedEventsCounter` and `lastReportTimestamp` are unused, so it certainly makes sense to clean them up one way or the other -- but that might mean we should delete them, not that we necessarily need to do something else with them. I could see an argument that there are already monitoring systems hooked up to the old metric, ["numEventsDropped"](https://github.com/apache/spark/blob/718bbc939037929ef5b8f4b4fe10aadfbab4408e/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L266), so maybe we should bring back the total with that metric. But do you really want even more logging of the total, beyond the logging from each queue? Seems like it would only be more confusing to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20171#discussion_r161657719 --- Diff: python/pyspark/sql/catalog.py --- @@ -256,27 +258,58 @@ def registerFunction(self, name, f, returnType=StringType()): >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.register("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] + >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.udf.register("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(()=26)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.register("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP +[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] """ # This is to check whether the input function is a wrapped/native UserDefinedFunction if hasattr(f, 'asNondeterministic'): -udf = UserDefinedFunction(f.func, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF, - deterministic=f.deterministic) +if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, + PythonEvalType.SQL_PANDAS_SCALAR_UDF]: +raise ValueError( +"Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF") +if returnType is not None and not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) +if returnType is not None and returnType != f.returnType: --- End diff -- Optional value is okay but I mean it's better to throw an exception. I am not seeing the advantage of supporting this optionally. @ueshin do you think it's better to support this case? I am less sure of the point of supporting `returnType` with UDF when we are disallowed to change. It causes confusion like we allow it but then if the type is different, we will issue an exception. Is it more important to allow this corner case than we make the APIs clear as if we have `def register(name, f) # for UDF` alone? We can keep clear about disallowing `returnType` at register time too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20257 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20257 **[Test build #86153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86153/testReport)** for PR 20257 at commit [`262c046`](https://github.com/apache/spark/commit/262c0461bd5226b2e99ca5b0c35cf2a372a4892c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20257 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86153/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/20168 @MrBago @tomasatdatabricks the changes look good to me, I went through everything one more time, I'll sign off as soon as the python tests are fixed (it looks like there were some style issues in last commit) and all other dev comments are resolved, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r161656633 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala --- @@ -136,10 +137,52 @@ object DecimalType extends AbstractDataType { case DoubleType => DoubleDecimal } + private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match { +case v: Short => fromBigDecimal(BigDecimal(v)) +case v: Int => fromBigDecimal(BigDecimal(v)) +case v: Long => fromBigDecimal(BigDecimal(v)) +case _ => forType(literal.dataType) + } + + private[sql] def fromBigDecimal(d: BigDecimal): DecimalType = { +DecimalType(Math.max(d.precision, d.scale), d.scale) + } + private[sql] def bounded(precision: Int, scale: Int): DecimalType = { DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE)) } + /** + * Scale adjustment implementation is based on Hive's one, which is itself inspired to + * SQLServer's one. In particular, when a result precision is greater than + * {@link #MAX_PRECISION}, the corresponding scale is reduced to prevent the integral part of a + * result from being truncated. + * + * This method is used only when `spark.sql.decimalOperations.allowPrecisionLoss` is set to true. + * + * @param precision + * @param scale + * @return + */ + private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = { --- End diff -- So the rule in document is ``` val resultPrecision = 38 if (intDigits < 32) { // This means scale > 6, as iniDigits = precision - scale and precision > 38 val maxScale = 38 - intDigits val resultScale = min(scale, maxScale) } else { if (scale < 6) { // can't round as scale is already small val resultScale = scale } else { val resltScale = 6 } } ``` I think this is a little different from the current rule ``` val minScaleValue = Math.min(scale, 6) val resultScale = max(38 - intDigits, minScaleValue) ``` Think aboout the case `iniDigits < 32`, SQL server is `min(scale, 38 - intDigits)`, we are `38 - intDigits` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161656541 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type + * @param nChannels number of color channels + */ + case class OpenCvType(mode: Int, dataType: String, nChannels: Int) { +def name: String = if (mode == -1) { "Undefined" } else { s"CV_$dataType" + s"C$nChannels" } +override def toString: String = s"OpenCvType(mode = $mode, name = $name)" + } /** - * (Scala-specific) OpenCV type mapping supported + * Return the supported OpenCvType with matching name or raise error if there is no matching type. + * + * @param name: name of existing OpenCvType + * @return OpenCvType that matches the given name */ - val ocvTypes: Map[String, Int] = Map( -undefinedImageType -> -1, -"CV_8U" -> 0, "CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24 - ) + def ocvTypeByName(name: String): OpenCvType = { +ocvTypes.find(x => x.name == name).getOrElse( + throw new IllegalArgumentException("Unknown open cv type " + name)) --- End diff -- same minor nitpick: "OpenCV" instead of "open cv", and in code below as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20232: [SPARK-23042][ML] Use OneHotEncoderModel to encode label...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20232 **[Test build #86154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86154/testReport)** for PR 20232 at commit [`20bbf64`](https://github.com/apache/spark/commit/20bbf64e0ce99538f80e5b6f360a69de93f4d9fc). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20204#discussion_r161655584 --- Diff: python/run-tests-with-coverage --- @@ -0,0 +1,69 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +set -o pipefail +set -e + +# This variable indicates which coverage executable to run to combine coverages +# and generate HTMLs, for example, 'coverage3' in Python 3. +COV_EXEC="${COV_EXEC:-coverage}" +FWDIR="$(cd "`dirname $0`"; pwd)" +pushd "$FWDIR" > /dev/null --- End diff -- Do we need to use `pushd` and its corresponding `popd` at the end of this file? I guess we can simply use `cd` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user imatiach-msft commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r161656419 --- Diff: mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala --- @@ -37,20 +37,67 @@ import org.apache.spark.sql.types._ @Since("2.3.0") object ImageSchema { - val undefinedImageType = "Undefined" + /** + * OpenCv type representation + * + * @param mode ordinal for the type + * @param dataType open cv data type --- End diff -- small nitpick: I think we should always spell it as "OpenCV" to be consistent in the comments (unless you have any good objections) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user imatiach-msft commented on the issue: https://github.com/apache/spark/pull/20168 @MrBago @tomasatdatabricks I think the breaking changes are fine, the code was marked experimental and it is expected that the interfaces will change a lot initially based on early feedback. The PR looks good to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20257 **[Test build #86153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86153/testReport)** for PR 20257 at commit [`262c046`](https://github.com/apache/spark/commit/262c0461bd5226b2e99ca5b0c35cf2a372a4892c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20272: [SPARK-23078] [CORE] allow Spark Thrift Server to run in...
Github user ozzieba commented on the issue: https://github.com/apache/spark/pull/20272 I'm getting stuck on https://github.com/apache-spark-on-k8s/spark-integration/blob/master/integration-test/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala#L106, will look again tomorrow --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20257: [SPARK-23048][ML] Add OneHotEncoderEstimator document an...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20257 @MLnick Thanks for review. I think I've addressed all the comments. Please take a look for the updates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20023: [SPARK-22036][SQL] Decimal multiplication with hi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20023#discussion_r161655115 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala --- @@ -243,17 +279,43 @@ object DecimalPrecision extends TypeCoercionRule { // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case b @ BinaryOperator(left, right) if left.dataType != right.dataType => - (left.dataType, right.dataType) match { -case (t: IntegralType, DecimalType.Fixed(p, s)) => - b.makeCopy(Array(Cast(left, DecimalType.forType(t)), right)) -case (DecimalType.Fixed(p, s), t: IntegralType) => - b.makeCopy(Array(left, Cast(right, DecimalType.forType(t -case (t, DecimalType.Fixed(p, s)) if isFloat(t) => - b.makeCopy(Array(left, Cast(right, DoubleType))) -case (DecimalType.Fixed(p, s), t) if isFloat(t) => - b.makeCopy(Array(Cast(left, DoubleType), right)) -case _ => - b - } + nondecimalLiteralAndDecimal(b).lift((left, right)).getOrElse( +nondecimalNonliteralAndDecimal(b).applyOrElse((left.dataType, right.dataType), + (_: (DataType, DataType)) => b)) } + + /** + * Type coercion for BinaryOperator in which one side is a non-decimal literal numeric, and the + * other side is a decimal. + */ + private def nondecimalLiteralAndDecimal( + b: BinaryOperator): PartialFunction[(Expression, Expression), Expression] = { +// Promote literal integers inside a binary expression with fixed-precision decimals to +// decimals. The precision and scale are the ones needed by the integer value. +case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] + && l.dataType.isInstanceOf[IntegralType] => + b.makeCopy(Array(Cast(l, DecimalType.forLiteral(l)), r)) --- End diff -- What if we don't do this? Requiring more precision seems OK as now we allow precision lose. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86144/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20208 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20208: [SPARK-23007][SQL][TEST] Add schema evolution test suite...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20208 **[Test build #86144 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86144/testReport)** for PR 20208 at commit [`499801e`](https://github.com/apache/spark/commit/499801e7fdd545ac5918dd5f7a9294db2d5373be). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait SchemaEvolutionTest extends QueryTest with SQLTestUtils with SharedSQLContext ` * `trait AddColumnEvolutionTest extends SchemaEvolutionTest ` * `trait RemoveColumnEvolutionTest extends SchemaEvolutionTest ` * `trait ChangePositionEvolutionTest extends SchemaEvolutionTest ` * `trait BooleanTypeEvolutionTest extends SchemaEvolutionTest ` * `trait IntegralTypeEvolutionTest extends SchemaEvolutionTest ` * `trait ToDoubleTypeEvolutionTest extends SchemaEvolutionTest ` * `trait ToDecimalTypeEvolutionTest extends SchemaEvolutionTest ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20171#discussion_r161654514 --- Diff: python/pyspark/sql/catalog.py --- @@ -256,27 +258,58 @@ def registerFunction(self, name, f, returnType=StringType()): >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.register("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] + >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.udf.register("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(()=26)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.register("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP +[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] """ # This is to check whether the input function is a wrapped/native UserDefinedFunction if hasattr(f, 'asNondeterministic'): -udf = UserDefinedFunction(f.func, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF, - deterministic=f.deterministic) +if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, + PythonEvalType.SQL_PANDAS_SCALAR_UDF]: +raise ValueError( +"Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF") +if returnType is not None and not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) +if returnType is not None and returnType != f.returnType: --- End diff -- I might miss something but I think it's okay to take `returnType` parameter optionally if the value is the same as the udf's. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20265 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86143/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20265 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20265: [SPARK-21783][SQL] Turn on ORC filter push-down by defau...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20265 **[Test build #86143 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86143/testReport)** for PR 20265 at commit [`440f76b`](https://github.com/apache/spark/commit/440f76bdbf4d720a361e0afde3599027ff6e7be2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20171: [SPARK-22978] [PySpark] Register Vectorized UDFs ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20171#discussion_r161654136 --- Diff: python/pyspark/sql/catalog.py --- @@ -256,27 +258,58 @@ def registerFunction(self, name, f, returnType=StringType()): >>> spark.sql("SELECT stringLengthInt('test')").collect() [Row(stringLengthInt(test)=4)] +>>> from pyspark.sql.types import IntegerType +>>> from pyspark.sql.functions import udf +>>> slen = udf(lambda s: len(s), IntegerType()) +>>> _ = spark.udf.register("slen", slen) +>>> spark.sql("SELECT slen('test')").collect() +[Row(slen(test)=4)] + >>> import random >>> from pyspark.sql.functions import udf ->>> from pyspark.sql.types import IntegerType, StringType +>>> from pyspark.sql.types import IntegerType >>> random_udf = udf(lambda: random.randint(0, 100), IntegerType()).asNondeterministic() ->>> newRandom_udf = spark.catalog.registerFunction("random_udf", random_udf, StringType()) +>>> newRandom_udf = spark.udf.register("random_udf", random_udf) >>> spark.sql("SELECT random_udf()").collect() # doctest: +SKIP -[Row(random_udf()=u'82')] +[Row(random_udf()=82)] >>> spark.range(1).select(newRandom_udf()).collect() # doctest: +SKIP -[Row(random_udf()=u'62')] +[Row(()=26)] + +>>> from pyspark.sql.functions import pandas_udf, PandasUDFType +>>> @pandas_udf("integer", PandasUDFType.SCALAR) # doctest: +SKIP +... def add_one(x): +... return x + 1 +... +>>> _ = spark.udf.register("add_one", add_one) # doctest: +SKIP +>>> spark.sql("SELECT add_one(id) FROM range(3)").collect() # doctest: +SKIP +[Row(add_one(id)=1), Row(add_one(id)=2), Row(add_one(id)=3)] """ # This is to check whether the input function is a wrapped/native UserDefinedFunction if hasattr(f, 'asNondeterministic'): -udf = UserDefinedFunction(f.func, returnType=returnType, name=name, - evalType=PythonEvalType.SQL_BATCHED_UDF, - deterministic=f.deterministic) +if f.evalType not in [PythonEvalType.SQL_BATCHED_UDF, + PythonEvalType.SQL_PANDAS_SCALAR_UDF]: +raise ValueError( +"Invalid f: f must be either SQL_BATCHED_UDF or SQL_PANDAS_SCALAR_UDF") +if returnType is not None and not isinstance(returnType, DataType): +returnType = _parse_datatype_string(returnType) +if returnType is not None and returnType != f.returnType: --- End diff -- Is it common in our current PySpark impl? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20273: [SPARK-23000] Use fully qualified table names in ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20273 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema test for f...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20266 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20266: [SPARK-23072][SQL][TEST] Add a Unicode schema tes...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20266#discussion_r161653628 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala --- @@ -0,0 +1,66 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import org.apache.spark.sql.test.SharedSQLContext + +class FileBasedDataSourceSuite extends QueryTest with SharedSQLContext { + import testImplicits._ + + Seq("orc", "parquet", "csv", "json", "text").foreach { format => +test(s"Writing empty datasets should not fail - $format") { + withTempDir { dir => + Seq("str").toDS.limit(0).write.format(format).save(dir.getCanonicalPath + "/tmp") + } +} + } + + Seq("orc", "parquet", "csv", "json").foreach { format => +test(s"Write and read back unicode schema - $format") { + withTempPath { path => +val dir = path.getCanonicalPath + +// scalastyle:off nonascii +val df = Seq("a").toDF("íê¸") +// scalastyle:on nonascii + +df.write.format(format).option("header", "true").save(dir) +val answerDf = spark.read.format(format).option("header", "true").load(dir) + +assert(df.schema === answerDf.schema) +checkAnswer(df, answerDf) + } +} + } + + // Only New OrcFileFormat supports this + Seq(classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat].getCanonicalName, --- End diff -- `spark.sql.orc.impl` is native by default, can we just use "orc" here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org