[GitHub] spark issue #22318: [SPARK-25150][SQL] Fix attribute deduplication in join
Github user peter-toth commented on the issue: https://github.com/apache/spark/pull/22318 @mgaido91 , 2.2 also suffered from this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22314 @ueshin Just verified in 2.3. This problem does not exist in 2.3. This is due to the fact that implementation of `nullSafeCodeGen` is different in 2.3 than in master. However, we are missing the test cases we added in these PRs in 2.3. Should we have the test cases checked in into the branch ? I am afraid that if we ever backported the pr that changed nullSafeCodeGen , we may introduce this bug. Please advise .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95645/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22324 **[Test build #95645 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95645/testReport)** for PR 22324 at commit [`510d729`](https://github.com/apache/spark/commit/510d729b0ed6f83b05a3b0f06c2631163d62ef1a). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class FileSourceSuite extends SharedSQLContext ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22318: [SPARK-25150][SQL] Fix attribute deduplication in...
Github user peter-toth commented on a diff in the pull request: https://github.com/apache/spark/pull/22318#discussion_r214793247 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala --- @@ -295,4 +295,14 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { df.join(df, df("id") <=> df("id")).queryExecution.optimizedPlan } } + + test("SPARK-25150: Attribute deduplication handles attributes in join condition properly") { +val a = spark.range(1, 5) +val b = spark.range(10) +val c = b.filter($"id" % 2 === 0) + +val r = a.join(b, a("id") === b("id"), "inner").join(c, a("id") === c("id"), "inner") --- End diff -- That simpler join doesn't hit the issue. It is handled by a different rule `ResolveNaturalAndUsingJoin`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214787227 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -398,6 +398,24 @@ class FilterPushdownBenchmark extends SparkFunSuite with BenchmarkBeforeAndAfter } } } + + test(s"Pushdown benchmark with many filters") { +val numRows = 1 +val width = 500 + +withTempPath { dir => + val columns = (1 to width).map(i => s"id c$i") + val df = spark.range(1).selectExpr(columns: _*) + withTempTable("orcTable", "patquetTable") { --- End diff -- nit: a typo, `patquetTable`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Comp...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22317 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Compilation...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22317 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22313 **[Test build #95651 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95651/testReport)** for PR 22313 at commit [`5c46693`](https://github.com/apache/spark/commit/5c46693e58e0f71fe8e67dce16f4b8c783c80aa6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Compilation...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22317 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2817/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r214786494 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -754,6 +754,47 @@ class HiveDDLSuite } } + test("Insert overwrite Hive table should output correct schema") { +withTable("tbl", "tbl2") { + withView("view1") { +spark.sql("CREATE TABLE tbl(id long)") +spark.sql("INSERT OVERWRITE TABLE tbl SELECT 4") +spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") +spark.sql("CREATE TABLE tbl2(ID long)") +spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") +checkAnswer(spark.table("tbl2"), Seq(Row(4))) --- End diff -- Add schema assert please. We can read data since [SPARK-25132](https://issues.apache.org/jira/browse/SPARK-25132). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22314 @ueshin Sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22314 @dilipbiswal Do we need to backport this to 2.3? If so, could you submit a backport pr to branch-2.3 please? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22315 @dilipbiswal Do we need to backport this to 2.3? If so, could you submit a backport pr to branch-2.3 please? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22219#discussion_r214785788 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3237,6 +3238,28 @@ class Dataset[T] private[sql]( files.toSet.toArray } + /** + * Returns the tuple of the row count and an SeqView that contains all rows in this Dataset. + * + * The SeqView will consume as much memory as the total size of serialized results which can be + * limited with the config 'spark.driver.maxResultSize'. Rows are deserialized when iterating rows + * with iterator of returned SeqView. Whether to collect all deserialized rows or to iterate them + * incrementally can be decided with considering total rows count and driver memory. + */ + private[sql] def collectCountAndSeqView(): (Long, SeqView[T, Array[T]]) = +withAction("collectCountAndSeqView", queryExecution) { plan => + // This projection writes output to a `InternalRow`, which means applying this projection is + // not thread-safe. Here we create the projection inside this method to make `Dataset` + // thread-safe. + val objProj = GenerateSafeProjection.generate(deserializer :: Nil) + val (totalRowCount, internalRowsView) = plan.executeCollectSeqView() + (totalRowCount, internalRowsView.map { row => +// The row returned by SafeProjection is `SpecificInternalRow`, which ignore the data type +// parameter of its `get` method, so it's safe to use null here. +objProj(row).get(0, null).asInstanceOf[T] + }.asInstanceOf[SeqView[T, Array[T]]]) +} --- End diff -- If this is a thriftserver specific issue, can we do the same thing by fixing code only in the thriftserver package? IMHO we'd be better not to modify code in the sql package as much as possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22319 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95644/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22319 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22219#discussion_r214785499 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -641,6 +641,16 @@ object SQLConf { .intConf .createWithDefault(200) + val THRIFTSERVER_BATCH_DESERIALIZE_LIMIT = +buildConf("spark.sql.thriftServer.batchDeserializeLimit") + .doc("The maximum number of result rows that can be deserialized at one time. " + +"If the number of result rows exceeds this value, the Thrift Server will only use " + +"'memory of serialized rows' + 'memory of the deserialized rows being fetched to the " + +"client'. Only valid if spark.sql.thriftServer.incrementalCollect is false. " + --- End diff -- nit: `s"client'. Only valid if ${THRIFTSERVER_INCREMENTAL_COLLECT.key} is false. " +` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22319 **[Test build #95644 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95644/testReport)** for PR 22319 at commit [`4791240`](https://github.com/apache/spark/commit/4791240d08c75d5df23332d0059a4b15197d289f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22314 @ueshin @kiszk @maropu Thanks a lot. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22314: [SPARK-25307][SQL] ArraySort function may return ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22314 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user dilipbiswal commented on the issue: https://github.com/apache/spark/pull/22315 @gatorsmile Sure.. I will check and add. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22314 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22315 @dilipbiswal Could we also add the test cases for the other high-order functions, if missing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22315: [SPARK-25308][SQL] ArrayContains function may ret...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22315 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22319: [SPARK-25044][SQL][followup] add back UserDefined...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/22319#discussion_r214784141 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala --- @@ -41,12 +41,16 @@ import org.apache.spark.sql.types.DataType case class UserDefinedFunction protected[sql] ( f: AnyRef, dataType: DataType, -inputTypes: Option[Seq[ScalaReflection.Schema]]) { +inputTypes: Option[Seq[DataType]]) { --- End diff -- +1. This is why we added _nameOption, _nullable and _deterministic in 2.3 release. Please also remove the changes of MimaExcludes.scala made in https://github.com/apache/spark/pull/22259 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22315 Thanks! merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22315 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22324 **[Test build #95650 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95650/testReport)** for PR 22324 at commit [`bc05a35`](https://github.com/apache/spark/commit/bc05a354e375dfb1df6a70a46f28b792f8567fc5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2816/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22324#discussion_r214783002 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} +import org.apache.spark.sql.test.SharedSQLContext + + +class FileSourceSuite extends SharedSQLContext { + + test("SPARK-25237 compute correct input metrics in FileScanRDD") { --- End diff -- ok --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22324 oh, I see. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22306 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22321: [DOC] Update some outdated links
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/22321 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22320 **[Test build #95649 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95649/testReport)** for PR 22320 at commit [`538fea9`](https://github.com/apache/spark/commit/538fea99ed2158316d89f64ce397c4791fbed1f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2815/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95643/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22179 **[Test build #95643 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95643/testReport)** for PR 22179 at commit [`f2fb28d`](https://github.com/apache/spark/commit/f2fb28da3eb272651530b77dbd4ea33511f0727d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class MapHolder ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214778954 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -71,12 +71,24 @@ private[orc] object OrcFilters { for { // Combines all convertible filters using `And` to produce a single conjunction - conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And) + conjunction <- buildTree(convertibleFilters) --- End diff -- BTW, Parquet has another issue here due to `.reduceOption(FilterApi.and)`. When I make a benchmark, Parquet seems to be unable to handle 1000 filters, @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r214778690 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -805,6 +805,80 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("Insert overwrite table command should output correct schema: basic") { +withTable("tbl", "tbl2") { + withView("view1") { +val df = spark.range(10).toDF("id") +df.write.format("parquet").saveAsTable("tbl") +spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") +spark.sql("CREATE TABLE tbl2(ID long) USING parquet") +spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") +val identifier = TableIdentifier("tbl2", Some("default")) +val location = spark.sessionState.catalog.getTableMetadata(identifier).location.toString +val expectedSchema = StructType(Seq(StructField("ID", LongType, true))) +assert(spark.read.parquet(location).schema == expectedSchema) +checkAnswer(spark.table("tbl2"), df) + } +} + } + + test("Insert overwrite table command should output correct schema: complex") { +withTable("tbl", "tbl2") { + withView("view1") { +val df = spark.range(10).map(x => (x, x.toInt, x.toInt)).toDF("col1", "col2", "col3") +df.write.format("parquet").saveAsTable("tbl") +spark.sql("CREATE VIEW view1 AS SELECT * FROM tbl") +spark.sql("CREATE TABLE tbl2(COL1 long, COL2 int, COL3 int) USING parquet PARTITIONED " + + "BY (COL2) CLUSTERED BY (COL3) INTO 3 BUCKETS") +spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT COL1, COL2, COL3 FROM view1") +val identifier = TableIdentifier("tbl2", Some("default")) +val location = spark.sessionState.catalog.getTableMetadata(identifier).location.toString +val expectedSchema = StructType(Seq( + StructField("COL1", LongType, true), --- End diff -- Keep it should be OK. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r214778523 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -805,6 +805,80 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be } } + test("Insert overwrite table command should output correct schema: basic") { +withTable("tbl", "tbl2") { + withView("view1") { +val df = spark.range(10).toDF("id") --- End diff -- This is trivial...As the column name `id` is case sensitive and used below, I would like to show it explicitly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214778262 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25306 createFilter should not hang") { +import org.apache.spark.sql.sources._ +val schema = new StructType(Array(StructField("a", IntegerType, nullable = true))) +val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter] +failAfter(2 seconds) { + OrcFilters.createFilter(schema, filters) --- End diff -- I'll choose (2), @cloud-fan . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...
Github user xuejianbest commented on a diff in the pull request: https://github.com/apache/spark/pull/22048#discussion_r214778257 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging { } } } + + /** + * Regular expression matching full width characters + */ + private val fullWidthRegex = ("""[""" + +// scalastyle:off nonascii +"""\u1100-\u115F""" + +"""\u2E80-\uA4CF""" + +"""\uAC00-\uD7A3""" + +"""\uF900-\uFAFF""" + +"""\uFE10-\uFE19""" + +"""\uFE30-\uFE6F""" + +"""\uFF00-\uFF60""" + +"""\uFFE0-\uFFE6""" + --- End diff -- > Can you describe them there and put a references to a public unicode document? This is a regular expression match using unicode, regardless of the specific encoding. For example, the following string is encoded using gbk instead of utf8, and the match still works: ` val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 0xFA.toByte) val s1 = new String(bytes, "gbk") println(s1) //ä¸Â国 val fullWidthRegex = ("""[""" + // scalastyle:off nonascii """\u1100-\u115F""" + """\u2E80-\uA4CF""" + """\uAC00-\uD7A3""" + """\uF900-\uFAFF""" + """\uFE10-\uFE19""" + """\uFE30-\uFE6F""" + """\uFF00-\uFF60""" + """\uFFE0-\uFFE6""" + // scalastyle:on nonascii """]""").r println(fullWidthRegex.findAllIn(s1).size) //2 ` This regular expression is obtained experimentally under a specific font. I don't understand what you are going to do. > How about some additional overheads when calling showString as compared to showString w/o this patch? I tested a Dataset consisting of 100 rows, each row has two columns, one column is the index (0-99), and the other column is a random string of length 100 characters, and then the showString display is called separately. The original showString method (w/o this patch) took about 42ms, and the improved time took about 46ms, and the performance was about 10% worse. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22325 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22325 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22325 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22325: [SPARK-25318]. Add exception handling when wrappi...
GitHub user rezasafi opened a pull request: https://github.com/apache/spark/pull/22325 [SPARK-25318]. Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block SPARK-4105 provided a solution to block corruption issue by retrying the fetch or the stage. In that solution there is a step that wraps the input stream with compression and/or encryption. This step is prone to exceptions, but in the current code there is no exception handling for this step and this has caused confusion for the user. This change adds exception handling for the wrapping step and also adds a fetch retry if we experience a corruption during the wrapping step. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rezasafi/spark localcorruption Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22325.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22325 commit cc1c4cdf2bd3b77326f831212c64ede338c807b1 Author: Reza Safi Date: 2018-09-04T03:06:33Z [SPARK-25318]. Add exception handling when wrapping the input stream during the the fetch or stage retry in response to a corrupted block --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22321 Mind fixing the PR title as well since we fix other broken links too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22240 **[Test build #95648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95648/testReport)** for PR 22240 at commit [`9b6a47b`](https://github.com/apache/spark/commit/9b6a47bf718309eb0b5a22a0282a5a7c4226e991). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22321 **[Test build #95647 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95647/testReport)** for PR 22321 at commit [`d9bbf3c`](https://github.com/apache/spark/commit/d9bbf3c4a7be82d66eb643a42c2724cd30ea1ad5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2814/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22324#discussion_r214776872 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala --- @@ -0,0 +1,48 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import scala.collection.mutable.ArrayBuffer + +import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd} +import org.apache.spark.sql.test.SharedSQLContext + + +class FileSourceSuite extends SharedSQLContext { + + test("SPARK-25237 compute correct input metrics in FileScanRDD") { --- End diff -- Shall we move this suite into `FileBasedDataSourceSuite`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22324 we can credit to multiple people now though :-) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214775155 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -71,12 +71,24 @@ private[orc] object OrcFilters { for { // Combines all convertible filters using `And` to produce a single conjunction - conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And) + conjunction <- buildTree(convertibleFilters) --- End diff -- For the first question, I don't think Parquet has the same issue because Parquet uses `canMakeFilterOn` while ORC is trying to build a full result (with a fresh builder) to check if it's okay or not. For the second question, in ORC, we already did the first half(`flatMap`) to compute `convertibleFilters`, but it can change it with `filters.filter`. ```scala val convertibleFilters = for { filter <- filters _ <- buildSearchArgument(dataTypeMap, filter, SearchArgumentFactory.newBuilder()) } yield filter ``` 2. And, the second half `reduceOption(FilterApi.and)` was the original ORC code which generated a skewed tree having exponential time complexity. We need to use `buildTree`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95642/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22240 **[Test build #95642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95642/testReport)** for PR 22240 at commit [`c61eec3`](https://github.com/apache/spark/commit/c61eec363f78d586070c673e44e9120eb10b83b5). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22179 **[Test build #95646 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95646/testReport)** for PR 22179 at commit [`0d78113`](https://github.com/apache/spark/commit/0d7811348e5746e1a7e1ce887d47ae4ba413c014). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2813/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22324 **[Test build #95645 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95645/testReport)** for PR 22324 at commit [`510d729`](https://github.com/apache/spark/commit/510d729b0ed6f83b05a3b0f06c2631163d62ef1a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22324 @srowen reworked cuz the author is inactive and can you check? (btw, it's ok that the credit of this commit goes to the original author.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22324 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2812/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...
GitHub user maropu opened a pull request: https://github.com/apache/spark/pull/22324 [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD ## What changes were proposed in this pull request? This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` because it computes input metrics by file size supported in Hadoop 2.5 and earlier. The current Spark does not support the versions, so it causes wrong input metric numbers. This is rework from #22232. Closes #22232 ## How was this patch tested? Added `FileSourceSuite` to tests this case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/maropu/spark pr22232-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22324.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22324 commit 0f75257b50a611e069d406da8d72225bb4e73b51 Author: dujunling Date: 2018-08-25T06:20:35Z remove updateBytesReadWithFileSize because we use Hadoop FileSystem statistics to update the inputMetrics commit 53dd42c1facebf97044afb22b1f0894ec209f3bb Author: dujunling Date: 2018-08-27T03:26:30Z add ut commit 1c326466fbd24c432184be6e53afec93369970c1 Author: dujunling Date: 2018-08-27T03:33:46Z ut commit 510d729b0ed6f83b05a3b0f06c2631163d62ef1a Author: Takeshi Yamamuro Date: 2018-09-04T01:47:59Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214769029 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25306 createFilter should not hang") { +import org.apache.spark.sql.sources._ +val schema = new StructType(Array(StructField("a", IntegerType, nullable = true))) +val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter] +failAfter(2 seconds) { + OrcFilters.createFilter(schema, filters) --- End diff -- Sure. Something like the test code in the PR description? And marked as `ignore(...)` instead of `test(...)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22315 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95638/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22315 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22315 **[Test build #95638 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95638/testReport)** for PR 22315 at commit [`59ddb99`](https://github.com/apache/spark/commit/59ddb993790f4bb0ec920a2b0d897d8052c9f108). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22319 **[Test build #95644 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95644/testReport)** for PR 22319 at commit [`4791240`](https://github.com/apache/spark/commit/4791240d08c75d5df23332d0059a4b15197d289f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22319 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22319 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2811/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/21860 cc: @cloud-fan @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22315 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214765115 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -71,12 +71,24 @@ private[orc] object OrcFilters { for { // Combines all convertible filters using `And` to produce a single conjunction - conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And) + conjunction <- buildTree(convertibleFilters) --- End diff -- In parquet, this is done as ``` filters .flatMap(ParquetFilters.createFilter(requiredSchema, _)) .reduceOption(FilterApi.and) ``` can we follow it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214765026 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala --- @@ -71,12 +71,24 @@ private[orc] object OrcFilters { for { // Combines all convertible filters using `And` to produce a single conjunction - conjunction <- convertibleFilters.reduceOption(org.apache.spark.sql.sources.And) + conjunction <- buildTree(convertibleFilters) --- End diff -- does parquet has the same problem? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22313#discussion_r214764993 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala --- @@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { )).get.toString } } + + test("SPARK-25306 createFilter should not hang") { +import org.apache.spark.sql.sources._ +val schema = new StructType(Array(StructField("a", IntegerType, nullable = true))) +val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter] +failAfter(2 seconds) { + OrcFilters.createFilter(schema, filters) --- End diff -- This test looks tricky... It's a bad practice to assume some code will return in a certain time. Can we just add a microbenchmark for it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95637/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22313 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22313 **[Test build #95637 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)** for PR 22313 at commit [`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22321 Thank you for your first contribution, @kisimple . As @kiszk mentioned, could you fix those files, too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22204: [SPARK-25196][SQL] Analyze column statistics in cached q...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22204 ok, I'll do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22204: [SPARK-25196][SQL] Analyze column statistics in cached q...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22204 Thank you, @maropu . BTW, if this PR aims to provide `ANALYZE` command interface to users, could you update the PR content and test cases for that? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22218: [SPARK-25228][CORE]Add executor CPU time metric.
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22218 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22179#discussion_r214762021 --- Diff: core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala --- @@ -412,6 +412,26 @@ class KryoSerializerSuite extends SparkFunSuite with SharedSparkContext { assert(!ser2.getAutoReset) } + test("ClassCastException when writing a Map after previously " + --- End diff -- Since this is a bug fix test case, could you add `SPARK-25176` like `SPARK-25176 ClassCastException ...`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r214761843 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -69,7 +69,7 @@ case class InsertIntoHiveTable( query: LogicalPlan, overwrite: Boolean, ifPartitionNotExists: Boolean, -outputColumns: Seq[Attribute]) extends SaveAsHiveFile { +outputColumnNames: Seq[String]) extends SaveAsHiveFile { --- End diff -- thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...
Github user maropu commented on a diff in the pull request: https://github.com/apache/spark/pull/22316#discussion_r214761811 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with SharedSQLContext { assert(exception.getMessage.contains("aggregate functions are not allowed")) } + + test("pivoting column list with values") { +val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) :: Nil +val df = trainingSales + .groupBy($"sales.year") + .pivot(struct(lower($"sales.course"), $"training"), Seq( +struct(lit("dotnet"), lit("Experts")), +struct(lit("java"), lit("Dummies"))) + ).agg(sum($"sales.earnings")) + +checkAnswer(df, expected) + } + + test("pivoting column list") { +val exception = intercept[RuntimeException] { + trainingSales +.groupBy($"sales.year") +.pivot(struct(lower($"sales.course"), $"training")) +.agg(sum($"sales.earnings")) +.collect() --- End diff -- I tried in your branch; ``` scala> df.show +++ |training| sales| +++ | Experts|[dotNET, 2012, 10...| | Experts|[JAVA, 2012, 2000...| | Dummies|[dotNet, 2012, 50...| | Experts|[dotNET, 2013, 48...| | Dummies|[Java, 2013, 3000...| +++ scala> df.groupBy($"sales.year").pivot(struct(lower($"sales.course"), $"training")).agg(sum($"sales.earnings")) java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [dotnet,Dummies] at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164) at scala.util.Try.getOrElse(Try.scala:79) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163) at org.apache.spark.sql.functions$.typedLit(functions.scala:127) ``` I miss something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2810/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22179 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22179 **[Test build #95643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95643/testReport)** for PR 22179 at commit [`f2fb28d`](https://github.com/apache/spark/commit/f2fb28da3eb272651530b77dbd4ea33511f0727d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95641/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22240 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org