[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21416 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91207/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21420 **[Test build #91207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91207/testReport)** for PR 21420 at commit [`e66ea49`](https://github.com/apache/spark/commit/e66ea49000860d593074296b2a86e8bbdf5f0261). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21010 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3631/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21010 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21010: [SPARK-23900][SQL] format_number support user specifed f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21010 **[Test build #91214 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91214/testReport)** for PR 21010 at commit [`9ccb648`](https://github.com/apache/spark/commit/9ccb6488f6f8309e0cfa71c4b332e6d680f24ffa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191119148 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191118958 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -379,6 +379,8 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * that should be used for parsing. * `samplingRatio` (default is 1.0): defines fraction of input JSON objects used * for schema inferring. + * `dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20929: [SPARK-23772][SQL] Provide an option to ignore co...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/20929#discussion_r191118916 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala --- @@ -2408,4 +2408,24 @@ class JsonSuite extends QueryTest with SharedSQLContext with TestJsonData { spark.read.option("mode", "PERMISSIVE").option("encoding", "UTF-8").json(Seq(badJson).toDS()), Row(badJson)) } + + test("SPARK-23772 ignore column of all null values or empty array during schema inference") { + withTempPath { tempDir => + val path = tempDir.getAbsolutePath + Seq( +"""{"a":null, "b":[null, null], "c":null, "d":[[], [null]], "e":{}}""", +"""{"a":null, "b":[null], "c":[], "d": [null, []], "e":{}}""", +"""{"a":null, "b":[], "c":[], "d": null, "e":null}""") +.toDS().write.mode("overwrite").text(path) --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21424: [SPARK-24379] BroadcastExchangeExec should catch ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21424#discussion_r191118494 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -106,11 +108,20 @@ private[execution] object HashedRelation { 1), 0) } - -if (key.length == 1 && key.head.dataType == LongType) { - LongHashedRelation(input, key, sizeEstimate, mm) -} else { - UnsafeHashedRelation(input, key, sizeEstimate, mm) +try { + if (key.length == 1 && key.head.dataType == LongType) { +LongHashedRelation(input, key, sizeEstimate, mm) + } else { +UnsafeHashedRelation(input, key, sizeEstimate, mm) + } +} catch { + case oe: SparkOutOfMemoryError => +throw new SparkOutOfMemoryError(s"If this SparkOutOfMemoryError happens in Spark driver," + --- End diff -- it seems we don't need to change anything, maybe just add some comments to say where OOM can occur, i.e. `RDD#collect` and `BroadcastMode#transform` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21416 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21379: [SPARK-24327][SQL] Add an option to quote a parti...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21379#discussion_r191117129 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala --- @@ -78,7 +79,12 @@ private[sql] object JDBCRelation extends Logging { // Overflow and silliness can happen if you subtract then divide. // Here we get a little roundoff, but that's (hopefully) OK. val stride: Long = upperBound / numPartitions - lowerBound / numPartitions -val column = partitioning.column +val column = if (jdbcOptions.quotePartitionColumnName) { + val dialect = JdbcDialects.get(jdbcOptions.url) + dialect.quoteIdentifier(partitioning.column) --- End diff -- [The latest fix](https://github.com/apache/spark/pull/21379/commits/8d5fa9a25dff75327e8ff8ff13b756b895e6e512) changes an existing behaviour (when quoting non-partition column names), so I'm not sure this fix is acceptable. Any suggestion? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21379 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21379 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3630/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21379: [SPARK-24327][SQL] Add an option to quote a partition co...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21379 **[Test build #91213 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91213/testReport)** for PR 21379 at commit [`8d5fa9a`](https://github.com/apache/spark/commit/8d5fa9a25dff75327e8ff8ff13b756b895e6e512). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21416 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3629/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21416 **[Test build #91212 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91212/testReport)** for PR 21416 at commit [`1332406`](https://github.com/apache/spark/commit/1332406d7f4ca7a9a4a85338f758430ecc334ff8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21383: [SPARK-23754][Python] Re-raising StopIteration in...
Github user e-dorigatti commented on a diff in the pull request: https://github.com/apache/spark/pull/21383#discussion_r191113977 --- Diff: python/pyspark/util.py --- @@ -89,6 +93,33 @@ def majorMinorVersion(sparkVersion): " version numbers.") +def fail_on_stopiteration(f): +""" +Wraps the input function to fail on 'StopIteration' by raising a 'RuntimeError' +prevents silent loss of data when 'f' is used in a for loop +""" +def wrapper(*args, **kwargs): +try: +return f(*args, **kwargs) +except StopIteration as exc: +raise RuntimeError( +"Caught StopIteration thrown from user's code; failing the task", +exc +) + +# prevent inspect to fail +# e.g. inspect.getargspec(sum) raises +# TypeError: is not a Python function +try: +argspec = _get_argspec(f) --- End diff -- You said to do it in `udf.UserDefinedFunction._create_judf`, but sent the code of `udf._create_udf`. I assume you meant the former, since we cannot do that in `_create_udf` (`UserDefinedFunction._wrapped` needs the original function for its documentation and other stuff). I will also simplify the code as you suggested, yes --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21416 @cloud-fan unfortunately, scala vararg can not be overloaded, and scala will return the following error. ```scala Error:(410, 32) ambiguous reference to overloaded definition, both method isin in class Column of type (values: Iterable[_])org.apache.spark.sql.Column and method isin in class Column of type (list: Any*)org.apache.spark.sql.Column match argument types (Seq[Int]) checkAnswer(df.filter($"a".isin(Seq(1, 2))), ``` I'm leaning toward to using `isInCollection` now, and implement the corresponding python APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21383: [SPARK-23754][Python] Re-raising StopIteration in client...
Github user e-dorigatti commented on the issue: https://github.com/apache/spark/pull/21383 Yes, the problem was that the signature is lost when the function is wrapped, and the worker needs the signature to know whether the function needs keys together with values or not. What I meant is that fixing worker side might fix this specific JIRA, but the bug will still occur when an udf is not used in a worker, but somewhere else. If you fix at the udf side, however, the bug will never occur again, regardless of how and where the udf is used. I don't know the codebase well enough to know whether this will be a real problem, though --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16478 @metasim Thanks for comment. This patch takes the approach to use Scala data types on UDT and let SparkSQL's encoder to convert user data to internal format. I'm not sure which part the serialization/deserialization cost you'd like to avoid. Following the link to `InternalRowTile`, seems it wraps an `InternalRow` and access some fields in the row. You still need to deserialize to `InternalRow` before accessing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21435 Merged to master and branch-2.3. There were minor conflicts but I just resolved it by myself. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experim...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21435 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21435 Thanks @BryanCutler. still LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3628/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91209/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21435 **[Test build #91209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91209/testReport)** for PR 21435 at commit [`6b87330`](https://github.com/apache/spark/commit/6b873309b3db7022cfe47561c47d7ae320a46d65). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3627/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21288 **[Test build #91211 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91211/testReport)** for PR 21288 at commit [`2c0d5cb`](https://github.com/apache/spark/commit/2c0d5cbf51268540653543b96de135a6923c6cef). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21288 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmark bench...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21288 **[Test build #91210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91210/testReport)** for PR 21288 at commit [`b7859ed`](https://github.com/apache/spark/commit/b7859ed0905ce3e0476e5d327f65798acc7aba8c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21288: [SPARK-24206][SQL] Improve FilterPushdownBenchmar...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/21288#discussion_r191109472 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -0,0 +1,437 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.benchmark + +import java.io.File + +import scala.util.{Random, Try} + +import org.apache.spark.SparkConf +import org.apache.spark.sql.{DataFrame, SparkSession} +import org.apache.spark.sql.functions.monotonically_increasing_id +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.util.{Benchmark, Utils} + + +/** + * Benchmark to measure read performance with Filter pushdown. + * To run this: + * spark-submit --class + */ +object FilterPushdownBenchmark { + val conf = new SparkConf() +.setAppName("FilterPushdownBenchmark") +.setIfMissing("spark.master", "local[1]") +.setIfMissing("spark.driver.memory", "3g") +.setIfMissing("spark.executor.memory", "3g") +.setIfMissing("orc.compression", "snappy") +.setIfMissing("spark.sql.parquet.compression.codec", "snappy") + + private val spark = SparkSession.builder().config(conf).getOrCreate() + + def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + def withTempTable(tableNames: String*)(f: => Unit): Unit = { +try f finally tableNames.foreach(spark.catalog.dropTempView) + } + + def withSQLConf(pairs: (String, String)*)(f: => Unit): Unit = { +val (keys, values) = pairs.unzip +val currentValues = keys.map(key => Try(spark.conf.get(key)).toOption) +(keys, values).zipped.foreach(spark.conf.set) +try f finally { + keys.zip(currentValues).foreach { +case (key, Some(value)) => spark.conf.set(key, value) +case (key, None) => spark.conf.unset(key) + } +} + } + + private def prepareTable( + dir: File, numRows: Int, width: Int, useStringForValue: Boolean): Unit = { +import spark.implicits._ +val selectExpr = (1 to width).map(i => s"CAST(value AS STRING) c$i") +val valueCol = if (useStringForValue) { + monotonically_increasing_id().cast("string") +} else { + monotonically_increasing_id() +} +val df = spark.range(numRows).map(_ => Random.nextLong).selectExpr(selectExpr: _*) + .withColumn("value", valueCol) + .sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def prepareStringDictTable( + dir: File, numRows: Int, numDistinctValues: Int, width: Int): Unit = { +val selectExpr = (0 to width).map { + case 0 => s"CAST(id % $numDistinctValues AS STRING) AS value" + case i => s"CAST(rand() AS STRING) c$i" +} +val df = spark.range(numRows).selectExpr(selectExpr: _*).sort("value") + +saveAsOrcTable(df, dir.getCanonicalPath + "/orc") +saveAsParquetTable(df, dir.getCanonicalPath + "/parquet") + } + + private def saveAsOrcTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").orc(dir) +spark.read.orc(dir).createOrReplaceTempView("orcTable") + } + + private def saveAsParquetTable(df: DataFrame, dir: String): Unit = { +df.write.mode("overwrite").parquet(dir) +spark.read.parquet(dir).createOrReplaceTempView("parquetTable") + } + + def filterPushDownBenchmark( + values: Int, + title: String, + whereExpr: String, + selectExpr: String = "*"): Unit = { +val benchmark = new Benchmark(title, values, minNumIters = 5) + +Seq(false, true).foreach { pushDownEnabled => + val name
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21435 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21435 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3626/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21432: [SPARK-24373][SQL] Add AnalysisBarrier to Relatio...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21432 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experimental
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21435 **[Test build #91209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91209/testReport)** for PR 21435 at commit [`6b87330`](https://github.com/apache/spark/commit/6b873309b3db7022cfe47561c47d7ae320a46d65). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21435: [SPARK-24392][PYTHON] Label pandas_udf as Experim...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/21435#discussion_r191107787 --- Diff: docs/sql-programming-guide.md --- @@ -1827,6 +1827,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's ORC data source table and ORC vectorization would be applied. To set `false` to `spark.sql.hive.convertMetastoreOrc` restores the previous behavior. - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. +## Upgrading From Spark SQL 2.3.0 to 2.3.1 and Above --- End diff -- sure --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21432: [SPARK-24373][SQL] Add AnalysisBarrier to RelationalGrou...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21432 thanks, merging to master/2.3! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21424: [SPARK-24379] BroadcastExchangeExec should catch ...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/21424#discussion_r191107018 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -106,11 +108,20 @@ private[execution] object HashedRelation { 1), 0) } - -if (key.length == 1 && key.head.dataType == LongType) { - LongHashedRelation(input, key, sizeEstimate, mm) -} else { - UnsafeHashedRelation(input, key, sizeEstimate, mm) +try { + if (key.length == 1 && key.head.dataType == LongType) { +LongHashedRelation(input, key, sizeEstimate, mm) + } else { +UnsafeHashedRelation(input, key, sizeEstimate, mm) + } +} catch { + case oe: SparkOutOfMemoryError => +throw new SparkOutOfMemoryError(s"If this SparkOutOfMemoryError happens in Spark driver," + --- End diff -- So, I change back ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21389 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21389 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3625/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21389: [SPARK-24204][SQL] Verify a schema in Json/Orc/ParquetFi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21389 **[Test build #91208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91208/testReport)** for PR 21389 at commit [`04f4028`](https://github.com/apache/spark/commit/04f40281e2a457ea27d425b5b1db0e07a0150aaf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21416 Yes. The design of PySpark API seems to be a bit different than Scala/Java API at beginning. If we are going to make them consistent, either we break Scala queries like `col.isin(Array[Byte](1,2,3))` or PySpark queries like `col.isin([1, 2, 3])`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21319 Anyway I think moving statistics to physical plan is the ultimate solution, all others are workarounds, we should pick the simplest workaround. I'm glad to take your stats visitor approach if it's simpler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21319 > but in the first pass I see that it breaks equality by relying on a user-supplied reader instance. Can you say more about this? I explicitly mentioned it in the PR description that > keep DataSourceReader as an optional parameter in the constructor of DataSourceV2Relation, exclude it in the equality definition but include it when copying. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19602: [SPARK-22384][SQL] Refine partition pruning when ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19602#discussion_r191104615 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala --- @@ -657,18 +656,46 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { val useAdvanced = SQLConf.get.advancedPartitionPredicatePushdownEnabled +object ExtractAttribute { + def unapply(expr: Expression): Option[Attribute] = { +expr match { + case attr: Attribute => Some(attr) + case cast @ Cast(child, dt: StringType, _) if child.dataType.isInstanceOf[NumericType] => +unapply(child) + case cast @ Cast(child, dt: NumericType, _) if child.dataType == StringType => --- End diff -- I think we should only support safe upcast, `cast("1.234" as int)` should be excluded. Can we follow `Cast.mayTruncate` strictly? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21424: [SPARK-24379] BroadcastExchangeExec should catch ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21424#discussion_r191104413 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala --- @@ -106,11 +108,20 @@ private[execution] object HashedRelation { 1), 0) } - -if (key.length == 1 && key.head.dataType == LongType) { - LongHashedRelation(input, key, sizeEstimate, mm) -} else { - UnsafeHashedRelation(input, key, sizeEstimate, mm) +try { + if (key.length == 1 && key.head.dataType == LongType) { +LongHashedRelation(input, key, sizeEstimate, mm) + } else { +UnsafeHashedRelation(input, key, sizeEstimate, mm) + } +} catch { + case oe: SparkOutOfMemoryError => +throw new SparkOutOfMemoryError(s"If this SparkOutOfMemoryError happens in Spark driver," + --- End diff -- ah i see. So the `SparkOutOfMemoryError` is thrown by `BytesToBytesMap`, we need to catch and rethrow it to attach the error message anyway. I also found that we may throw OOM when calling `child.executeCollectIterator` which calls `RDD#collect`, seems the previous code is corrected. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21397: [SPARK-24334] Fix race condition in ArrowPythonRu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21397 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21416 @viirya good point. One thing I'm not sure is, does `isin(collection: Iterable)` conflict with `isin(list: Any*)`? if they don't conflict, they we can follow pyspark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21397 Merged to master and branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21416: [SPARK-24371] [SQL] Added isInCollection in DataF...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21416#discussion_r191101660 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -219,7 +218,14 @@ object ReorderAssociativeOperator extends Rule[LogicalPlan] { object OptimizeIn extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { case q: LogicalPlan => q transformExpressionsDown { - case In(v, list) if list.isEmpty && !v.nullable => FalseLiteral + case In(v, list) if list.isEmpty => --- End diff -- this improvement looks reasonable, but can we move them to a separated PR? it's not related to adding `isInCollection`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18390: [SPARK-21178][ML] Add support for label specific metrics...
Github user thesuperzapper commented on the issue: https://github.com/apache/spark/pull/18390 @MLnick @WeichenXu123 sorry to ping again, but what is currently stopping this from merging? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3624/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21420 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21420: [SPARK-24377][Spark Submit] make --py-files work in non ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21420 **[Test build #91207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91207/testReport)** for PR 21420 at commit [`e66ea49`](https://github.com/apache/spark/commit/e66ea49000860d593074296b2a86e8bbdf5f0261). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21317: [SPARK-24232][k8s] Add support for secret env var...
Github user skonto commented on a diff in the pull request: https://github.com/apache/spark/pull/21317#discussion_r191092872 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/SecretEnvUtils.scala --- @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.k8s + +import scala.collection.JavaConverters._ + +import io.fabric8.kubernetes.api.model.Container + +private[spark] object SecretEnvUtils { + + def containerHasEnvVar(container: Container, envVarName: String): Boolean = { --- End diff -- Ok will fix. Guys I have one thing that worries me. Yaml allows to pass as characters even the reserved ones. So if the name for the secret had a character like `:` character then we cant handle this right now. In yaml that would be escaped or we could use double quotes. Here we dont have that logic correct? Is it ok not to have the same level of expressiveness? At the end of the day yaml seems close to k8s and I was wondering why we dont just specify all properties for pods (driver & executor) in the yaml format read it and set everyting via that method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090953 --- Diff: R/pkg/R/functions.R --- @@ -3062,6 +3077,21 @@ setMethod("array_sort", column(jc) }) +#' @details +#' \code{arrays_overlap}: Returns true if the input arrays have at least one non-null element in +#' common. If not and both arrays are non-empty and any of them contains a null, it returns null. +#' It returns false otherwise. +#' +#' @rdname column_collection_functions +#' @aliases arrays_overlap arrays_overlap,Column-method +#' @note arrays_overlap since 2.4.0 +setMethod("arrays_overlap", + signature(y = "Column", x = "Column"), --- End diff -- right, I don't know why they were (y, x) either. for some (one?) it was to match existing parameter names (like `atan2`), and then it sticks. I think we should first - name the first column `x`, second - stay close to the parameter name in Scala for everything else. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090821 --- Diff: R/pkg/R/functions.R --- @@ -207,7 +208,7 @@ NULL #' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp)) #' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1))) #' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1))) -#' head(select(tmp, array_position(tmp$v1, 21), array_sort(tmp$v1))) +#' head(select(tmp, array_position(tmp$v1, 21), array_repeat(21, 5L), array_sort(tmp$v1))) --- End diff -- also for `5L`, `5` should be ok and more clear as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090863 --- Diff: R/pkg/R/functions.R --- @@ -3048,6 +3048,26 @@ setMethod("array_position", column(jc) }) +#' @details +#' \code{array_repeat}: Creates an array containing the left argument repeated the number of times +#' given by the right argument. +#' +#' @param count Column or constant determining the number of repetitions. +#' @rdname column_collection_functions +#' @aliases array_repeat array_repeat,Column,numericOrColumn-method +#' @note array_repeat since 2.4.0 +setMethod("array_repeat", + signature(x = "Column", count = "numericOrColumn"), + function(x, count) { +if (class(count) == "Column") { +count <- count@jc +} else { +count <- as.integer(count) --- End diff -- indent is 2 space actually, could you update this and line L3063 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090619 --- Diff: R/pkg/R/functions.R --- @@ -207,7 +208,7 @@ NULL #' tmp <- mutate(df, v1 = create_array(df$mpg, df$cyl, df$hp)) #' head(select(tmp, array_contains(tmp$v1, 21), size(tmp$v1))) #' head(select(tmp, array_max(tmp$v1), array_min(tmp$v1))) -#' head(select(tmp, array_position(tmp$v1, 21), array_sort(tmp$v1))) +#' head(select(tmp, array_position(tmp$v1, 21), array_repeat(21, 5L), array_sort(tmp$v1))) --- End diff -- this example is a bit unusual? do you intend for the first param to be `21` the constant? (also, does that work?) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090696 --- Diff: R/pkg/R/functions.R --- @@ -3048,6 +3048,26 @@ setMethod("array_position", column(jc) }) +#' @details +#' \code{array_repeat}: Creates an array containing the left argument repeated the number of times +#' given by the right argument. +#' +#' @param count Column or constant determining the number of repetitions. --- End diff -- change to `@param count a Column or constant determining the number of repetitions.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21434: [SPARK-24331][SparkR][SQL] Adding arrays_overlap,...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/21434#discussion_r191090725 --- Diff: R/pkg/R/functions.R --- @@ -3048,6 +3048,26 @@ setMethod("array_position", column(jc) }) +#' @details +#' \code{array_repeat}: Creates an array containing the left argument repeated the number of times +#' given by the right argument. --- End diff -- let's change this to ``` #' \code{array_repeat}: Creates an array containing \code{x} repeated the number of times #' given by \code{count}. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21416: [SPARK-24371] [SQL] Added isInCollection in DataFrame AP...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/21416 @cloud-fan Let me know if the new API looks good to you. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91206/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #91206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91206/testReport)** for PR 21370 at commit [`425bee1`](https://github.com/apache/spark/commit/425bee1628917859b58dc87faccb7bc6146b7f1f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21424 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21424 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91203/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21424 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91204/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21424 **[Test build #91203 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91203/testReport)** for PR 21424 at commit [`8f224fb`](https://github.com/apache/spark/commit/8f224fbedda0d2e126810918c99291eb395d6bab). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21424: [SPARK-24379] BroadcastExchangeExec should catch SparkOu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21424 **[Test build #91204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91204/testReport)** for PR 21424 at commit [`3a9669c`](https://github.com/apache/spark/commit/3a9669cff3486aa98cf0a49e69cc0e08d927affd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91202/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16478 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16478: [SPARK-7768][SQL] Revise user defined types (UDT)
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16478 **[Test build #91202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91202/testReport)** for PR 16478 at commit [`ae00de1`](https://github.com/apache/spark/commit/ae00de13dd779a2a09b142c54a2fcc144d7f8c23). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21397 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21397 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91201/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21397 **[Test build #91201 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91201/testReport)** for PR 21397 at commit [`756a73a`](https://github.com/apache/spark/commit/756a73aea843e8d5d90994d127c0d9d4c357c67b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21397: [SPARK-24334] Fix race condition in ArrowPythonRu...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/21397#discussion_r191082371 --- Diff: python/pyspark/sql/tests.py --- @@ -4931,6 +4931,30 @@ def foo3(key, pdf): expected4 = udf3.func((), pdf) self.assertPandasEqual(expected4, result4) +# Regression test for SPARK-24334 +def test_memory_leak(self): --- End diff -- SGTM! Moved to PR description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21397: [SPARK-24334] Fix race condition in ArrowPythonRunner ca...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/21397 Sure! Added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3623/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #91206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91206/testReport)** for PR 21370 at commit [`425bee1`](https://github.com/apache/spark/commit/425bee1628917859b58dc87faccb7bc6146b7f1f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080316 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM to Python worker for every task. + + spark.sql.repl.eagerEval.enabled + false + +Enable eager evaluation or not. If true and repl you're using supports eager evaluation, +dataframe will be ran automatically and html table will feedback the queries user have defined +(see https://issues.apache.org/jira/browse/SPARK-24215";>SPARK-24215 for more details). + + + + spark.sql.repl.eagerEval.showRows + 20 + +Default number of rows in HTML table. + + + + spark.sql.repl.eagerEval.truncate --- End diff -- Yep, I just want to keep the same behavior of `dataframe.show`. ``` That's useful for console output, but not so much for notebooks. ``` Notebooks aren't afraid for too many chaacters within a cell, so I just delete this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3622/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080194 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,9 +238,13 @@ class Dataset[T] private[sql]( * @param truncate If set to more than 0, truncates strings to `truncate` characters and * all cells will be aligned right. * @param vertical If set to true, prints output rows vertically (one line per column value). + * @param html If set to true, return output as html table. --- End diff -- @viirya @gatorsmile @rdblue Sorry for the late commit, the refactor do in 94f3414. I spend some time on testing and implementing the transformation of rows between python and scala. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #91205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91205/testReport)** for PR 21370 at commit [`94f3414`](https://github.com/apache/spark/commit/94f3414ebb689f4435018eab2e888e7d2974dc98). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91205/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21370 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -358,6 +357,43 @@ class Dataset[T] private[sql]( sb.toString() } + /** + * Transform current row string and append to builder + * + * @param row Current row of string + * @param truncate If set to more than 0, truncates strings to `truncate` characters and + *all cells will be aligned right. + * @param colWidths The width of each column + * @param html If set to true, return output as html table. + * @param head Set to true while current row is table head. + * @param sbStringBuilder for current row. + */ + private[sql] def appendRowString( + row: Seq[String], + truncate: Int, + colWidths: Array[Int], + html: Boolean, + head: Boolean, + sb: StringBuilder): Unit = { +val data = row.zipWithIndex.map { case (cell, i) => + if (truncate > 0) { +StringUtils.leftPad(cell, colWidths(i)) + } else { +StringUtils.rightPad(cell, colWidths(i)) + } +} +(html, head) match { + case (true, true) => +data.map(StringEscapeUtils.escapeHtml).addString( + sb, "", "\n", "\n") --- End diff -- I change the format in python \_repr\_html\_ in 94f3414. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21370 **[Test build #91205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91205/testReport)** for PR 21370 at commit [`94f3414`](https://github.com/apache/spark/commit/94f3414ebb689f4435018eab2e888e7d2974dc98). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080049 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) else: -print(self._jdf.showString(n, int(truncate), vertical)) +print(self._jdf.showString(n, int(truncate), vertical, False)) --- End diff -- Fix in 94f3414. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080066 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) else: -print(self._jdf.showString(n, int(truncate), vertical)) +print(self._jdf.showString(n, int(truncate), vertical, False)) def __repr__(self): return "DataFrame[%s]" % (", ".join("%s: %s" % c for c in self.dtypes)) +def _repr_html_(self): +"""Returns a dataframe with html code when you enabled eager evaluation +by 'spark.sql.repl.eagerEval.enabled', this only called by repr you're --- End diff -- Thanks, change to REPL in 94f3414. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080057 --- Diff: python/pyspark/sql/tests.py --- @@ -3040,6 +3040,50 @@ def test_csv_sampling_ratio(self): .csv(rdd, samplingRatio=0.5).schema self.assertEquals(schema, StructType([StructField("_c0", IntegerType(), True)])) +def _get_content(self, content): +""" +Strips leading spaces from content up to the first '|' in each line. +""" +import re +pattern = re.compile(r'^ *\|', re.MULTILINE) --- End diff -- Thanks! Fix it in 94f3414. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r191080044 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob """ if isinstance(truncate, bool) and truncate: -print(self._jdf.showString(n, 20, vertical)) +print(self._jdf.showString(n, 20, vertical, False)) --- End diff -- Thanks, fix in 94f3414. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org