[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180664561 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -274,3 +279,7 @@ abstract class BinaryNode extends LogicalPlan { override final def children: Seq[LogicalPlan] = Seq(left, right) } + +abstract class KeepOrderUnaryNode extends UnaryNode { --- End diff -- thanks for the suggestion. I'd love to hear also @cloud-fan's and @wzhfy's opinion on this in order to choose all together the best name for it. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180664594 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -22,10 +22,11 @@ import org.apache.spark.sql.{execution, Row} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, LeftOuter, RightOuter} -import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Repartition} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Repartition, Sort} import org.apache.spark.sql.catalyst.plans.physical._ import org.apache.spark.sql.execution.columnar.InMemoryRelation -import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ReusedExchangeExec, ReuseExchange, ShuffleExchangeExec} +import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ReusedExchangeExec, ReuseExchange, + ShuffleExchangeExec} --- End diff -- why? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/21019 @squito @cloud-fan How do you think this change ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21034 **[Test build #89183 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89183/testReport)** for PR 21034 at commit [`3a76d87`](https://github.com/apache/spark/commit/3a76d87492a8d03a29aeee877f5564071b0a68a8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21039 **[Test build #89185 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89185/testReport)** for PR 21039 at commit [`0b5a075`](https://github.com/apache/spark/commit/0b5a075b9de40d36379551720d7e0e07ad6deb40). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/21039 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20981 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20981 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2202/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpa...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21007#discussion_r180710305 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +class TestQueryExecutionListener extends QueryExecutionListener with Logging { --- End diff -- oops true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20931: [SPARK-23815][Core]Spark writer dynamic partition overwr...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20931 **[Test build #89187 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89187/testReport)** for PR 20931 at commit [`08b9601`](https://github.com/apache/spark/commit/08b9601ee9a0850f8fea0af7c9027bdef2016857). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r180713699 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala --- @@ -176,12 +176,13 @@ class HadoopTableReader( val matches = fs.globStatus(pathPattern) matches.foreach(fileStatus => existPathSet += fileStatus.getPath.toString) } -// convert /demo/data/year/month/day to /demo/data/*/*/*/ +// convert /demo/data/year/month/day to /demo/data/year/month/*/ --- End diff -- This is a pretty old logic. Can you explain what's going on here and why your change works? It can help other people to understand your change quickly, --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21039#discussion_r180719456 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -236,6 +236,8 @@ case class HashAggregateExec( | } """.stripMargin) +bufVars = null // explicitly null this field out to allow the referent to be GC'd sooner --- End diff -- I am curious what happens when `bufVars ` is accessed in `doConsumeWithoutKeys`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2198/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21014: [SPARK-23941][Mesos] Mesos task failed on specific spark...
Github user tiboun commented on the issue: https://github.com/apache/spark/pull/21014 Ah yes, correct Vanzin, I didn't think about that use case. So I will apply shellEscape for both in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21021 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2199/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21021 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21007#discussion_r180705768 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/TestQueryExecutionListener.scala --- @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.concurrent.atomic.AtomicBoolean + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +class TestQueryExecutionListener extends QueryExecutionListener with Logging { --- End diff -- No need to with Logging now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...
Github user hvanhovell commented on the issue: https://github.com/apache/spark/pull/20981 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180716400 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -169,4 +169,6 @@ case class InMemoryRelation( override protected def otherCopyArgs: Seq[AnyRef] = Seq(_cachedColumnBuffers, sizeInBytesStats, statsOfPlanToCache) + + override def outputOrdering: Seq[SortOrder] = child.outputOrdering --- End diff -- We should carry the logical ordering from the cached logical plan when building the `InMemoryRelation` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21037: [SPARK-23919][SQL] Add array_position function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21037 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2203/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20871: [SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20871 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20874: [SPARK-23763][SQL] OffHeapColumnVector uses MemoryBlock
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20874 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2205/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20874: [SPARK-23763][SQL] OffHeapColumnVector uses MemoryBlock
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20874 **[Test build #89190 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89190/testReport)** for PR 20874 at commit [`9ef19df`](https://github.com/apache/spark/commit/9ef19dfcde9dc84f494bff5f03a56db840741496). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21031 **[Test build #89189 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89189/testReport)** for PR 21031 at commit [`d405d8a`](https://github.com/apache/spark/commit/d405d8a37c6e46340f49ff449b8a5aeca80de751). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21037: [SPARK-23919][SQL] Add array_position function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21037 **[Test build #89188 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89188/testReport)** for PR 21037 at commit [`535e511`](https://github.com/apache/spark/commit/535e511a4d33d815a621b57c47901886b2ce8fc7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20871: [SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20871 **[Test build #89191 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89191/testReport)** for PR 20871 at commit [`a7665fd`](https://github.com/apache/spark/commit/a7665fd64875e1343796c8923126d715b5376808). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20874: [SPARK-23763][SQL] OffHeapColumnVector uses MemoryBlock
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20874 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21031 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21031 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2204/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21037: [SPARK-23919][SQL] Add array_position function
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21037 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20871: [SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20871 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2206/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21035: [SPARK-23847][FOLLOWUP][PYTHON][SQL] Actually test [desc...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21035 Merged to master. Thanks @felixcheung. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2207/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka connector partition rev...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21038 **[Test build #89177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89177/testReport)** for PR 21038 at commit [`f317dec`](https://github.com/apache/spark/commit/f317dec0d863a717dc424707571453b11c43e700). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka connector partition rev...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21038 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2197/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka connector partition rev...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21038 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15770: [SPARK-15784][ML]:Add Power Iteration Clustering to spar...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/15770 @wangmiao1981 If you're busy I can help take over this. -:) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #89178 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89178/testReport)** for PR 20345 at commit [`94d9171`](https://github.com/apache/spark/commit/94d9171b8ec26c21724dd393cf4fc83ff52623e7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20931: [SPARK-23815][Core]Spark writer dynamic partition overwr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20931 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21034 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89183/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20535: [SPARK-23341][SQL] define some standard options f...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20535#discussion_r180714763 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceOptions.java --- @@ -17,16 +17,61 @@ package org.apache.spark.sql.sources.v2; +import java.io.IOException; import java.util.HashMap; import java.util.Locale; import java.util.Map; import java.util.Optional; +import java.util.stream.Stream; + +import com.fasterxml.jackson.databind.ObjectMapper; import org.apache.spark.annotation.InterfaceStability; /** * An immutable string-to-string map in which keys are case-insensitive. This is used to represent * data source options. + * + * Each data source implementation can define its own options and teach its users how to set them. + * Spark doesn't have any restrictions about what options a data source should or should not have. + * Instead Spark defines some standard options that data sources can optionally adopt. It's possible + * that some options are very common and many data sources use them. However different data + * sources may define the common options(key and meaning) differently, which is quite confusing to + * end users. + * + * The standard options defined by Spark: + * + * + * Option key + * Option value + * + * + * path + * A path string of the data files/directories, like + * path1, /absolute/file2, path3/*. The path can + * either be relative or absolute, points to either file or directory, and can contain + * wildcards. This option is commonly used by file-based data sources. + * + * + * paths + * A JSON array style paths string of the data files/directories, like + * ["path1", "/absolute/file2"]. The format of each path is same as the + * path option, plus it should follow JSON string literal format, e.g. quotes + * should be escaped, pa\"th means pa"th. --- End diff -- pa\"th? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21034 **[Test build #89183 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89183/testReport)** for PR 21034 at commit [`3a76d87`](https://github.com/apache/spark/commit/3a76d87492a8d03a29aeee877f5564071b0a68a8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class Reverse(child: Expression) extends UnaryExpression with ImplicitCastInputTypes ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180714835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala --- @@ -274,3 +279,7 @@ abstract class BinaryNode extends LogicalPlan { override final def children: Seq[LogicalPlan] = Seq(left, right) } + +abstract class KeepOrderUnaryNode extends UnaryNode { --- End diff -- OrderPreservingUnaryNode sounds better. It only makes sense for unary node, so I don't think mixin trait is a good idea. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21034 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21009: [SPARK-23905][SQL] Add UDF weekday
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21009 **[Test build #89181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89181/testReport)** for PR 21009 at commit [`79beb00`](https://github.com/apache/spark/commit/79beb00026fb2dd2df295bccd797aaf21746ebdb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21014: [SPARK-23941][Mesos] Mesos task failed on specific spark...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21014 **[Test build #89180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89180/testReport)** for PR 21014 at commit [`4732c4b`](https://github.com/apache/spark/commit/4732c4ba0f47d35d9f3b8a67a3fc272a78f82d7c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21021 **[Test build #89179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89179/testReport)** for PR 21021 at commit [`0203920`](https://github.com/apache/spark/commit/020392024f19cbc1ce172051643e15dade7e7b19). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180664857 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -733,6 +735,17 @@ object EliminateSorts extends Rule[LogicalPlan] { } } +/** + * Removes Sort operation if the child is already sorted + */ +object RemoveRedundantSorts extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Sort(orders, true, child) if child.outputOrdering.nonEmpty +&& SortOrder.orderingSatisfies(child.outputOrdering, orders) => + child + } --- End diff -- Yes, you're right. Probably we can do this in other PR. May you open a JIRA for this? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21014: [SPARK-23941][Mesos] Mesos task failed on specific spark...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21014 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21014: [SPARK-23941][Mesos] Mesos task failed on specific spark...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21014 **[Test build #89180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89180/testReport)** for PR 21014 at commit [`4732c4b`](https://github.com/apache/spark/commit/4732c4ba0f47d35d9f3b8a67a3fc272a78f82d7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21014: [SPARK-23941][Mesos] Mesos task failed on specific spark...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21014 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89180/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20983: [SPARK-23747][Structured Streaming] Add EpochCoor...
Github user efimpoberezkin commented on a diff in the pull request: https://github.com/apache/spark/pull/20983#discussion_r180679855 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/streaming/continuous/EpochCoordinatorSuite.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming.continuous + +import org.mockito.InOrder +import org.mockito.Matchers.{any, eq => eqTo} +import org.mockito.Mockito._ +import org.scalatest.BeforeAndAfterEach +import org.scalatest.mockito.MockitoSugar + +import org.apache.spark._ +import org.apache.spark.rpc.RpcEndpointRef +import org.apache.spark.sql.execution.streaming.continuous._ +import org.apache.spark.sql.sources.v2.reader.streaming.{ContinuousReader, PartitionOffset} +import org.apache.spark.sql.sources.v2.writer.WriterCommitMessage +import org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter +import org.apache.spark.sql.test.SharedSparkSession + +class EpochCoordinatorSuite + extends SparkFunSuite +with SharedSparkSession +with MockitoSugar +with BeforeAndAfterEach { + + private var epochCoordinator: RpcEndpointRef = _ + + private var writer: StreamWriter = _ + private var query: ContinuousExecution = _ + private var orderVerifier: InOrder = _ + + override def beforeEach(): Unit = { +val reader = mock[ContinuousReader] +writer = mock[StreamWriter] +query = mock[ContinuousExecution] +orderVerifier = inOrder(writer, query) + +epochCoordinator + = EpochCoordinatorRef.create(writer, reader, query, "test", 1, spark, SparkEnv.get) + } + + override def afterEach(): Unit = { +SparkEnv.get.rpcEnv.stop(epochCoordinator) + } + + test("single epoch") { +setWriterPartitions(3) +setReaderPartitions(2) + +commitPartitionEpoch(0, 1) +commitPartitionEpoch(1, 1) +commitPartitionEpoch(2, 1) +reportPartitionOffset(0, 1) +reportPartitionOffset(1, 1) + +// Here and in subsequent tests this is called to make a synchronous call to EpochCoordinator +// so that mocks would have been acted upon by the time verification happens +makeSynchronousCall() + +verifyCommit(1) + } + + test("single epoch, all but one writer partition has committed") { +setWriterPartitions(3) +setReaderPartitions(2) + +commitPartitionEpoch(0, 1) +commitPartitionEpoch(1, 1) +reportPartitionOffset(0, 1) +reportPartitionOffset(1, 1) + +makeSynchronousCall() + +verifyCommitHasntHappened(1) + } + + test("single epoch, all but one reader partition has reported an offset") { +setWriterPartitions(3) +setReaderPartitions(2) + +commitPartitionEpoch(0, 1) +commitPartitionEpoch(1, 1) +commitPartitionEpoch(2, 1) +reportPartitionOffset(0, 1) + +makeSynchronousCall() + +verifyCommitHasntHappened(1) --- End diff -- @tdas @jose-torres done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20981: [SPARK-23873][SQL] Use accessors in interpreted L...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/20981#discussion_r180685351 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala --- @@ -119,4 +119,28 @@ object InternalRow { case v: MapData => v.copy() case _ => value } + + /** + * Returns an accessor for an `InternalRow` with given data type. The returned accessor + * actually takes a `SpecializedGetters` input because it can be generalized to other classes + * that implements `SpecializedGetters` (e.g., `ArrayData`) too. + */ + def getAccessor(dataType: DataType): (SpecializedGetters, Int) => Any = dataType match { --- End diff -- Hehe - good point. Let's leave it here for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20931: [SPARK-23815][Core]Spark writer dynamic partition overwr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20931 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180715118 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -733,6 +735,17 @@ object EliminateSorts extends Rule[LogicalPlan] { } } +/** + * Removes Sort operation if the child is already sorted + */ +object RemoveRedundantSorts extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case Sort(orders, true, child) if child.outputOrdering.nonEmpty +&& SortOrder.orderingSatisfies(child.outputOrdering, orders) => + child + } --- End diff -- This is a good follow-up --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka partition revoked issue
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/21038 @koeninger would you please help to review, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #89192 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89192/testReport)** for PR 20345 at commit [`94d9171`](https://github.com/apache/spark/commit/94d9171b8ec26c21724dd393cf4fc83ff52623e7). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/21036 Thanks, I will try to add test cases. @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21038: [SPARK-22968][DStream] Fix Kafka connector partit...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/21038 [SPARK-22968][DStream] Fix Kafka connector partition revoked issue ## What changes were proposed in this pull request? Kafka partitions can be revoked when new consumers joined in the consumer group to rebalance the partitions. But current Spark Kafka connector code doesn't consider partition revoking scenarios, trying to get latest offset from revoked partitions, which will throw exceptions as JIRA mentioned. To reproduce this issue, user could start two `DirectKafkaWordCount` example subsequently. When the later one is started, it will trigger rebalance and reallocate the partitions across all the consumers, then the former one will throw an exception and stop the app. To fix it, Spark Kafka connector should consider partition revoking scenarios and don't maintain offsets for revoked partitions. Besides, this PR also fixes bugs in `DirectKafkaWordCount`, this example simply cannot be worked without the fix. ``` 8/01/05 09:48:27 INFO internals.ConsumerCoordinator: Revoking previously assigned partitions [kssh-7, kssh-4, kssh-3, kssh-6, kssh-5, kssh-0, kssh-2, kssh-1] for group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: (Re-)joining group use_a_separate_group_id_for_each_stream 18/01/05 09:48:27 INFO internals.AbstractCoordinator: Successfully joined group use_a_separate_group_id_for_each_stream with generation 4 18/01/05 09:48:27 INFO internals.ConsumerCoordinator: Setting newly assigned partitions [kssh-7, kssh-4, kssh-6, kssh-5] for group use_a_separate_group_id_for_each_stream ``` ## How was this patch tested? This is manually verified in local cluster, unfortunately I'm not sure how to simulate it in UT, so propose the PR without UT added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-22968 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21038.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21038 commit f317dec0d863a717dc424707571453b11c43e700 Author: jerryshaoDate: 2018-04-11T06:59:15Z Fix Kafka connector partition revoked issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21007 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20345 **[Test build #89178 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89178/testReport)** for PR 20345 at commit [`94d9171`](https://github.com/apache/spark/commit/94d9171b8ec26c21724dd393cf4fc83ff52623e7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21004#discussion_r180720924 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionProviderCompatibilitySuite.scala --- @@ -81,7 +81,7 @@ class PartitionProviderCompatibilitySuite HiveCatalogMetrics.reset() assert(spark.sql("select * from test where partCol < 2").count() == 2) assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount() == 2) - assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 2) + assert(HiveCatalogMetrics.METRIC_FILES_DISCOVERED.getCount() == 7) --- End diff -- what happened here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20345 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka partition revoked issue
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21038 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka partition revoked issue
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21038 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89177/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka partition revoked issue
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21038 **[Test build #89177 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89177/testReport)** for PR 21038 at commit [`f317dec`](https://github.com/apache/spark/commit/f317dec0d863a717dc424707571453b11c43e700). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21021: [SPARK-23921][SQL] Add array_sort function
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/21021 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21034: [SPARK-23926][SQL] Extending reverse function to ...
Github user mn-mikke commented on a diff in the pull request: https://github.com/apache/spark/pull/21034#discussion_r180686345 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -212,6 +213,96 @@ case class SortArray(base: Expression, ascendingOrder: Expression) override def prettyName: String = "sort_array" } +/** + * Returns a reversed string or an array with reverse order of elements. + */ +@ExpressionDescription( + usage = "_FUNC_(array) - Returns a reversed string or an array with reverse order of elements.", + examples = """ +Examples: + > SELECT _FUNC_('Spark SQL'); + LQS krapS + > SELECT _FUNC_(array(2, 1, 4, 3)); + [3, 4, 1, 2] + """, + since = "2.4.0") +case class Reverse(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { + + // Input types are utilized by type coercion in ImplicitTypeCasts. + override def inputTypes: Seq[AbstractDataType] = Seq(StringType) + + val allowedTypes = Seq(StringType, ArrayType) + + override def dataType: DataType = child.dataType + + lazy val elementType: DataType = dataType.asInstanceOf[ArrayType].elementType + + override def checkInputDataTypes(): TypeCheckResult = { --- End diff -- Great idea! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21034 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #89184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89184/testReport)** for PR 20888 at commit [`4e7be70`](https://github.com/apache/spark/commit/4e7be70e6d35f290a8ddd6278a9feccbc06791aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] Make DataFrameRangeSuite not flaky
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 Updated the title and the description to reflect the changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20981: [SPARK-23873][SQL] Use accessors in interpreted LambdaVa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20981 **[Test build #89186 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89186/testReport)** for PR 20981 at commit [`5711ea1`](https://github.com/apache/spark/commit/5711ea12574a67ee4e309b81c6d616e8841c4dd6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21007 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2200/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21007 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20981: [SPARK-23873][SQL] Use accessors in interpreted L...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20981#discussion_r180684792 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/InternalRow.scala --- @@ -119,4 +119,28 @@ object InternalRow { case v: MapData => v.copy() case _ => value } + + /** + * Returns an accessor for an `InternalRow` with given data type. The returned accessor + * actually takes a `SpecializedGetters` input because it can be generalized to other classes + * that implements `SpecializedGetters` (e.g., `ArrayData`) too. + */ + def getAccessor(dataType: DataType): (SpecializedGetters, Int) => Any = dataType match { --- End diff -- As `SpecializedGetters` is a java interface, seems we can't create a companion object for it? Or I miss something? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....
GitHub user rednaxelafx opened a pull request: https://github.com/apache/spark/pull/21039 [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars as transient ## What changes were proposed in this pull request? Mark `HashAggregateExec.bufVars` as transient to avoid it from being serialized. Also manually null out this field at the end of `doProduceWithoutKeys()` to shorten its lifecycle, because it'll no longer be used after that. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rednaxelafx/apache-spark codegen-improve Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21039.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21039 commit 0b5a075b9de40d36379551720d7e0e07ad6deb40 Author: Kris MokDate: 2018-04-11T09:35:31Z SPARK-23960: Mark HashAggregateExec.bufVars as transient --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [WIP][SPARK-23952] remove type parameter in DataR...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r180705366 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/memory.scala --- @@ -143,7 +144,7 @@ case class MemoryStream[A : Encoder](id: Int, sqlContext: SQLContext) logDebug(generateDebugString(newBlocks.flatten, startOrdinal, endOrdinal)) newBlocks.map { block => -new MemoryStreamDataReaderFactory(block).asInstanceOf[DataReaderFactory[UnsafeRow]] +new MemoryStreamDataReaderFactory(block).asInstanceOf[DataReaderFactory] --- End diff -- Array is a java-friendly type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21029: [WIP][SPARK-23952] remove type parameter in DataR...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r180705219 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -95,21 +77,29 @@ case class DataSourceV2ScanExec( sparkContext.getLocalProperty(ContinuousExecution.EPOCH_COORDINATOR_ID_KEY), sparkContext.env) .askSync[Unit](SetReaderPartitions(readerFactories.size)) - new ContinuousDataSourceRDD(sparkContext, sqlContext, readerFactories) -.asInstanceOf[RDD[InternalRow]] - -case r: SupportsScanColumnarBatch if r.enableBatchRead() => - new DataSourceRDD(sparkContext, batchReaderFactories).asInstanceOf[RDD[InternalRow]] - + if (readerFactories.exists(_.dataFormat() == DataFormat.COLUMNAR_BATCH)) { +throw new IllegalArgumentException( + "continuous stream reader does not support columnar read yet.") --- End diff -- We use a type erase hack, and lie to the Scala compiler that we are outputting InternalRow. At runtime, we cast the data to ColumnarBatch in codegen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20874: [SPARK-23763][SQL] OffHeapColumnVector uses MemoryBlock
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20874 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21031: [SPARK-23923][SQL] Add cardinality function
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21031 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20871: [SPARK-23762][SQL] UTF8StringBuffer uses MemoryBlock
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20871 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21037: [SPARK-23919][SQL] Add array_position function
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21037 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21004#discussion_r180720733 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala --- @@ -126,35 +126,35 @@ abstract class PartitioningAwareFileIndex( val caseInsensitiveOptions = CaseInsensitiveMap(parameters) val timeZoneId = caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION) .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone) - -userPartitionSchema match { +val inferredPartitionSpec = PartitioningUtils.parsePartitions( + leafDirs, + typeInference = sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled, + basePaths = basePaths, + timeZoneId = timeZoneId) +userSpecifiedSchema match { case Some(userProvidedSchema) if userProvidedSchema.nonEmpty => -val spec = PartitioningUtils.parsePartitions( - leafDirs, - typeInference = false, - basePaths = basePaths, - timeZoneId = timeZoneId) +val userPartitionSchema = + combineInferredAndUserSpecifiedPartitionSchema(inferredPartitionSpec) -// Without auto inference, all of value in the `row` should be null or in StringType, // we need to cast into the data type that user specified. def castPartitionValuesToUserSchema(row: InternalRow) = { InternalRow((0 until row.numFields).map { i => +val expr = inferredPartitionSpec.partitionColumns.fields(i).dataType match { + case StringType => Literal.create(row.getUTF8String(i), StringType) --- End diff -- why special case string type? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...
Github user guoxiaolongzte commented on a diff in the pull request: https://github.com/apache/spark/pull/21036#discussion_r180655799 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -55,7 +56,8 @@ private[spark] class HadoopPartition(rddId: Int, override val index: Int, s: Inp /** * Get any environment variables that should be added to the users environment when running pipes - * @return a Map with the environment variables and corresponding values, it could be empty +* +* @return a Map with the environment variables and corresponding values, it could be empty --- End diff -- Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21007: [SPARK-23942][PYTHON][SQL] Makes collect in PySpark as a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21007 **[Test build #89182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89182/testReport)** for PR 21007 at commit [`deacb17`](https://github.com/apache/spark/commit/deacb17816c21811efd630e91cee4c30d421eb36). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21038: [SPARK-22968][DStream] Fix Kafka partition revoked issue
Github user Hackeruncle commented on the issue: https://github.com/apache/spark/pull/21038 @jerryshao Thank you very much for this issue. I go to compile and test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20345 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89178/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wa...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/20888#discussion_r180691534 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameRangeSuite.scala --- @@ -164,10 +164,13 @@ class DataFrameRangeSuite extends QueryTest with SharedSQLContext with Eventuall sparkContext.addSparkListener(listener) for (codegen <- Seq(true, false)) { withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key -> codegen.toString()) { -DataFrameRangeSuite.stageToKill = -1 +DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID val ex = intercept[SparkException] { spark.range(0, 1000L, 1, 1).map { x => -DataFrameRangeSuite.stageToKill = TaskContext.get().stageId() +val taskContext = TaskContext.get() +if (!taskContext.isInterrupted()) { --- End diff -- To answer your question the whole point of this change is to block `DataFrameRangeSuite.stageToKill` overwrite while the first iteration's thread is running after `DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID` happened. This would work also but would be less trivial. I agree if `DataFrameRangeSuite.stageToKill` object member removed and switch to `onTaskStart` then the whole mumbo-jumbo is not required. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21039 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2201/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec.bufVars...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21039 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21019 cc @jiangxb1987 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180716173 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala --- @@ -22,10 +22,11 @@ import org.apache.spark.sql.{execution, Row} import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.expressions._ import org.apache.spark.sql.catalyst.plans.{Cross, FullOuter, Inner, LeftOuter, RightOuter} -import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Repartition} +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Repartition, Sort} import org.apache.spark.sql.catalyst.plans.physical._ import org.apache.spark.sql.execution.columnar.InMemoryRelation -import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ReusedExchangeExec, ReuseExchange, ShuffleExchangeExec} +import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ReusedExchangeExec, ReuseExchange, + ShuffleExchangeExec} --- End diff -- it's a unnecessary change. We don't have length limit for imports --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20560: [SPARK-23375][SQL] Eliminate unneeded Sort in Opt...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20560#discussion_r180716049 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala --- @@ -169,4 +169,6 @@ case class InMemoryRelation( override protected def otherCopyArgs: Seq[AnyRef] = Seq(_cachedColumnBuffers, sizeInBytesStats, statsOfPlanToCache) + + override def outputOrdering: Seq[SortOrder] = child.outputOrdering --- End diff -- in SparkPlan ``` /** Specifies how data is ordered in each partition. */ def outputOrdering: Seq[SortOrder] = Nil ``` So we can't do this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21018: [SPARK-23880][SQL] Do not trigger any jobs for ca...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21018#discussion_r180717823 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala --- @@ -119,26 +119,60 @@ class CacheManager extends Logging { while (it.hasNext) { val cd = it.next() if (cd.plan.find(_.sameResult(plan)).isDefined) { -cd.cachedRepresentation.cachedColumnBuffers.unpersist(blocking) +cd.cachedRepresentation.clearCache(blocking) it.remove() } } } + /** + * Materialize the cache that refers to the given physical plan. --- End diff -- why do we do this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20451: [SPARK-23146][WIP] Support client mode for Kubernetes cl...
Github user inetfuture commented on the issue: https://github.com/apache/spark/pull/20451 May I ask, what's the release plan for this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21039: [SPARK-23960][SQL][MINOR] Mark HashAggregateExec....
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/21039#discussion_r180721843 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala --- @@ -236,6 +236,8 @@ case class HashAggregateExec( | } """.stripMargin) +bufVars = null // explicitly null this field out to allow the referent to be GC'd sooner --- End diff -- The workflow of whole-stage codegen ensures that `doConsumeWithoutKeys` can only be on the call stack when `doProduceWithoutKeys` is also on the call stack; the liveness of the former is strictly a subset of the latter. That's because for a plan tree that looks like: ``` +- A +- B +- C ``` The whole-stage codegen system (mostly) works like: ``` A.produce |--> B.produce | |--> C.produce | | |--> B.consume | | | |--> A.consume | | | || | | | |<---o | | |
[GitHub] spark issue #20345: [SPARK-23172][SQL] Expand the ReorderJoin rule to handle...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20345 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org