[GitHub] spark pull request #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark t...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22661#discussion_r224676911 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala --- @@ -19,229 +19,163 @@ package org.apache.spark.sql.execution.benchmark import org.apache.spark.sql.execution.joins._ import org.apache.spark.sql.functions._ +import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types.IntegerType /** - * Benchmark to measure performance for aggregate primitives. - * To run this: - * build/sbt "sql/test-only *benchmark.JoinBenchmark" - * - * Benchmarks in this file are skipped in normal builds. + * Benchmark to measure performance for joins. + * To run this benchmark: + * {{{ + * 1. without sbt: + * bin/spark-submit --class --jars + * 2. build/sbt "sql/test:runMain " + * 3. generate result: + * SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + * Results will be written to "benchmarks/JoinBenchmark-results.txt". + * }}} */ -class JoinBenchmark extends BenchmarkWithCodegen { +object JoinBenchmark extends SqlBasedBenchmark { - ignore("broadcast hash join, long key") { + def broadcastHashJoinLongKey(): Unit = { val N = 20 << 20 val M = 1 << 16 -val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v")) -runBenchmark("Join w long", N) { - val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k")) +val dim = broadcast(spark.range(M).selectExpr("id as k", "cast(id as string) as v")) +codegenBenchmark("Join w long", N) { + val df = spark.range(N).join(dim, (col("id") % M) === col("k")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined) df.count() } - -/* -Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 -Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz -Join w long:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -Join w long codegen=false3002 / 3262 7.0 143.2 1.0X -Join w long codegen=true 321 / 371 65.3 15.3 9.3X -*/ } - ignore("broadcast hash join, long key with duplicates") { + def broadcastHashJoinLongKeyWithDuplicates(): Unit = { val N = 20 << 20 val M = 1 << 16 - -val dim = broadcast(sparkSession.range(M).selectExpr("id as k", "cast(id as string) as v")) -runBenchmark("Join w long duplicated", N) { - val dim = broadcast(sparkSession.range(M).selectExpr("cast(id/10 as long) as k")) - val df = sparkSession.range(N).join(dim, (col("id") % M) === col("k")) +val dim = broadcast(spark.range(M).selectExpr("cast(id/10 as long) as k")) +codegenBenchmark("Join w long duplicated", N) { + val df = spark.range(N).join(dim, (col("id") % M) === col("k")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined) df.count() } - -/* - *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 - *Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz - *Join w long duplicated: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - *--- - *Join w long duplicated codegen=false 3446 / 3478 6.1 164.3 1.0X - *Join w long duplicated codegen=true 322 / 351 65.2 15.3 10.7X - */ } - ignore("broadcast hash join, two int key") { + def broadcastHashJoinTwoIntKey(): Unit = { val N = 20 << 20 val M = 1 << 16 -val dim2 = broadcast(sparkSession.range(M) +val dim2 = broadcast(spark.range(M) .selectExpr("cast(id as int) as k1", "cast(id as int) as k2", "cast(id as string) as v")) -runBenchmark("Join w 2 ints", N) { - val df = sparkSession.range(N).join(dim2, +codegenBenchmark("Join w 2 ints", N) { + val df = spark.range(N).join(dim2, (col("id") % M).cast(IntegerType) === col("k1") && (col("id") % M).cast(IntegerType) === col("k2")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined) df.count() } - -/* - *Java HotSpot(TM) 64-Bit Server VM 1.7.0_60-b19 on Mac OS X 10.9.5 - *Intel(R) Core(TM)
[GitHub] spark issue #22668: [SPARK-25675] [Spark Job History] Job UI page does not s...
Github user shivusondur commented on the issue: https://github.com/apache/spark/pull/22668 @gengliangwang @felixcheung If everything okay, can you please merge the PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97289/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22696 **[Test build #97289 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97289/testReport)** for PR 22696 at commit [`b0dc140`](https://github.com/apache/spark/commit/b0dc140cd125498070143f67abf51204373fa14c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22646: [SPARK-25654][SQL] Support for nested JavaBean ar...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22646#discussion_r224671775 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala --- @@ -1115,9 +1126,38 @@ object SQLContext { }) } } -def createConverter(cls: Class[_], dataType: DataType): Any => Any = dataType match { - case struct: StructType => createStructConverter(cls, struct.map(_.dataType)) - case _ => CatalystTypeConverters.createToCatalystConverter(dataType) +def createConverter(t: Type, dataType: DataType): Any => Any = (t, dataType) match { + case (cls: Class[_], struct: StructType) => +// bean type +createStructConverter(cls, struct.map(_.dataType)) + case (arrayType: Class[_], array: ArrayType) if arrayType.isArray => +// array type +val converter = createConverter(arrayType.getComponentType, array.elementType) +value => new GenericArrayData( + (0 until JavaArray.getLength(value)).map(i => +converter(JavaArray.get(value, i))).toArray) + case (_, array: ArrayType) => +// java.util.List type +val cls = classOf[java.util.List[_]] --- End diff -- Seems like `JavaTypeInference.inferDataType()` supports `java.lang.Iterable`, not only `List`, but serializer/deserializer don't. Should we change `inferDataType()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22429: [SPARK-25440][SQL] Dumping query execution info to a fil...
Github user boy-uber commented on the issue: https://github.com/apache/spark/pull/22429 > @boy-uber the thing you are suggesting is a pretty big undertaking and beyond the scope of this PR. > > If you are going to add structured plans to the explain output, you probably also want some guarantees about stability over multiple spark versions and you probably also want to be able to reconstruct the plan. Neither is easy. If you want to have this in Spark, then I suggest sending a proposal to the dev list. Yeah, that is a larger change and may need more discussion. Your points about adding structured plans like that are great! Let me send a email to the dev list then! Thanks for the suggestion :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22466#discussion_r224667371 --- Diff: python/pyspark/sql/tests.py --- @@ -2993,6 +2990,7 @@ def test_current_database(self): AnalysisException, "does_not_exist", lambda: spark.catalog.setCurrentDatabase("does_not_exist")) +spark.sql("DROP DATABASE some_db") --- End diff -- We should surround with try-finally? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22466: [SPARK-25464][SQL] Create Database to the locatio...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/22466#discussion_r224666263 --- Diff: python/pyspark/sql/tests.py --- @@ -350,9 +350,6 @@ def test_sqlcontext_reuses_sparksession(self): def tearDown(self): --- End diff -- Now we can remove this method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22703: [SPARK-25705][BUILD][STREAMING] Remove Kafka 0.8 integra...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22703 **[Test build #97294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97294/testReport)** for PR 22703 at commit [`6e34ce7`](https://github.com/apache/spark/commit/6e34ce7ab7961531d97655e0733ed92f701fbbfd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22703: [SPARK-25705][BUILD][STREAMING] Remove Kafka 0.8 integra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22703 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22703: [SPARK-25705][BUILD][STREAMING] Remove Kafka 0.8 integra...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22703 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3913/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97288/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22702 **[Test build #97288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97288/testReport)** for PR 22702 at commit [`a9359ab`](https://github.com/apache/spark/commit/a9359abff62017f46f33ef18d7f56f97c885af3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputs...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22701 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22677: [SPARK-25683][Core] Make AsyncEventQueue.lastReportTimes...
Github user shivusondur commented on the issue: https://github.com/apache/spark/pull/22677 @jiangxb1987 Thanks for your comment, i think printing "since Wed Dec 31 16:00:00 PST 1969" still looks strange, Instead we can print "**since start of the application** for first time event Dropping, this looks more appropriate. so first time log should look like this **18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events since start of the application** Instead of **18/10/08 17:51:40 WARN AsyncEventQueue: Dropped 1 events from eventLog since Wed Dec 31 16:00:00 PST 1969.** please correct me if i am wrong. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputsForUDF ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22701 LGTM Thanks! Merged to master/2.4 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22575: [SPARK-24630][SS] Support SQLStreaming in Spark
Github user WangTaoTheTonic commented on the issue: https://github.com/apache/spark/pull/22575 How should we do if we wanna join two kafka stream and sink the result to another stream? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22661 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97287/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22661 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22375: [SPARK-25388][Test][SQL] Detect incorrect nullabl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22375 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22661 **[Test build #97287 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97287/testReport)** for PR 22661 at commit [`3be13b1`](https://github.com/apache/spark/commit/3be13b16f1a59ffbd158265f54ad4f8d511d2018). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22375: [SPARK-25388][Test][SQL] Detect incorrect nullable of Da...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22375 thanks, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22706: [SPARK-25716][SQL][MINOR] remove unnecessary collection ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22706 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22706: [SPARK-25716][SQL][MINOR] remove unnecessary collection ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22706 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22706: [SPARK-25716][SQL][MINOR] remove unnecessary collection ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22706 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22698: [SPARK-25710][SQL] range should report metrics correctly
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22698 **[Test build #97293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97293/testReport)** for PR 22698 at commit [`4058a21`](https://github.com/apache/spark/commit/4058a21bcffbf73a3d01edd76fb67ead434fb91c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22698: [SPARK-25710][SQL] range should report metrics co...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22698#discussion_r224659990 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -506,18 +513,18 @@ case class RangeExec(range: org.apache.spark.sql.catalyst.plans.logical.Range) | $numElementsTodo = 0; | if ($nextBatchTodo == 0) break; | } - | $numOutput.add($nextBatchTodo); - | $inputMetrics.incRecordsRead($nextBatchTodo); | $batchEnd += $nextBatchTodo * ${step}L; | } | | int $localEnd = (int)(($batchEnd - $nextIndex) / ${step}L); | for (int $localIdx = 0; $localIdx < $localEnd; $localIdx++) { | long $value = ((long)$localIdx * ${step}L) + $nextIndex; | ${consume(ctx, Seq(ev))} - | $shouldStop + | $stopCheck | } | $nextIndex = $batchEnd; + | $numOutput.add($localEnd); --- End diff -- If it is, then it is no problem. I was thinking that the number of output metric at range operator should be 100 if it is followed by a limit(100) operator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22698: [SPARK-25710][SQL] range should report metrics correctly
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22698 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Improve start-history-server.sh: sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22699 **[Test build #97292 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97292/testReport)** for PR 22699 at commit [`5e05c60`](https://github.com/apache/spark/commit/5e05c604fdc9913a1424a569deb16ec3301bd4e4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22704: [SPARK-25681][K8S][WIP] Leverage a config to tune renewa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22704 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22704: [SPARK-25681][K8S][WIP] Leverage a config to tune renewa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22704 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97286/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22706: [SPARK-25716][SQL][MINOR] remove unnecessary coll...
GitHub user SongYadong opened a pull request: https://github.com/apache/spark/pull/22706 [SPARK-25716][SQL][MINOR] remove unnecessary collection operation in valid constraints generation ## What changes were proposed in this pull request? Project logical operator generates valid constraints using two opposite operations. It substracts child constraints from all constraints, than union child constraints again. I think it may be not necessary. Aggregate operator has the same problem with Project. This PR try to remove these two opposite collection operations. ## How was this patch tested? Related unit tests: ProjectEstimationSuite CollapseProjectSuite PushProjectThroughUnionSuite UnsafeProjectionBenchmark GeneratedProjectionSuite CodeGeneratorWithInterpretedFallbackSuite TakeOrderedAndProjectSuite GenerateUnsafeProjectionSuite BucketedRandomProjectionLSHSuite RemoveRedundantAliasAndProjectSuite AggregateBenchmark AggregateOptimizeSuite AggregateEstimationSuite DecimalAggregatesSuite DateFrameAggregateSuite ObjectHashAggregateSuite TwoLevelAggregateHashMapSuite ObjectHashAggregateExecBenchmark SingleLevelAggregateHaspMapSuite TypedImperativeAggregateSuite RewriteDistinctAggregatesSuite HashAggregationQuerySuite HashAggregationQueryWithControlledFallbackSuite TypedImperativeAggregateSuite TwoLevelAggregateHashMapWithVectorizedMapSuite You can merge this pull request into a Git repository by running: $ git pull https://github.com/SongYadong/spark generate_constraints Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22706.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22706 commit fab5faaa838295affdb9a1bfeae1d613eddfb7a1 Author: SongYadong Date: 2018-10-11T14:12:05Z remove unnecessary collection operation in valid constraints generation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Improve start-history-server.sh: sho...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/22699 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224658881 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- after more thoughts, `a And (b Or c)` should be better than `If(IsNull(a), null, And(a, c))`, as it's more likely to get pushed down to data source, so the changes here LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22704: [SPARK-25681][K8S][WIP] Leverage a config to tune renewa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22704 **[Test build #97286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97286/testReport)** for PR 22704 at commit [`6e807e1`](https://github.com/apache/spark/commit/6e807e169cc9113c5fcd1653e610ec473c1ff8e8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Improve start-history-server.sh: sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22699 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22698: [SPARK-25710][SQL] range should report metrics correctly
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22698 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3912/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Improve start-history-server.sh: sho...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22699 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3911/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22375: [SPARK-25388][Test][SQL] Detect incorrect nullabl...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22375#discussion_r224660195 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ExpressionEvalHelper.scala --- @@ -69,11 +69,22 @@ trait ExpressionEvalHelper extends GeneratorDrivenPropertyChecks with PlanTestBa /** * Check the equality between result of expression and expected value, it will handle - * Array[Byte], Spread[Double], MapData and Row. + * Array[Byte], Spread[Double], MapData and Row. Also check whether nullable in expression is + * true if result is null */ - protected def checkResult(result: Any, expected: Any, exprDataType: DataType): Boolean = { + protected def checkResult(result: Any, expected: Any, expression: Expression): Boolean = { +checkResult(result, expected, expression.dataType, expression.nullable) + } + + protected def checkResult( + result: Any, + expected: Any, + exprDataType: DataType, + exprNullable: Boolean): Boolean = { val dataType = UserDefinedType.sqlType(exprDataType) +// The result is null for a non-nullable expression +assert(result != null || exprNullable, "exprNullable should be true if result is null") --- End diff -- nit: how about "result cannot be null since it's not nullable." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22698: [SPARK-25710][SQL] range should report metrics co...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22698#discussion_r224659380 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -506,18 +513,18 @@ case class RangeExec(range: org.apache.spark.sql.catalyst.plans.logical.Range) | $numElementsTodo = 0; | if ($nextBatchTodo == 0) break; | } - | $numOutput.add($nextBatchTodo); - | $inputMetrics.incRecordsRead($nextBatchTodo); | $batchEnd += $nextBatchTodo * ${step}L; | } | | int $localEnd = (int)(($batchEnd - $nextIndex) / ${step}L); | for (int $localIdx = 0; $localIdx < $localEnd; $localIdx++) { | long $value = ((long)$localIdx * ${step}L) + $nextIndex; | ${consume(ctx, Seq(ev))} - | $shouldStop + | $stopCheck | } | $nextIndex = $batchEnd; + | $numOutput.add($localEnd); --- End diff -- more background: the stop check for limit is done in batch granularity, while the stop check for result buffer is done in row granularity. That said, even if the limit is smaller than the batch size, the range operator still outputs a entire batch, physically. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22698: [SPARK-25710][SQL] range should report metrics co...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22698#discussion_r224659093 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala --- @@ -506,18 +513,18 @@ case class RangeExec(range: org.apache.spark.sql.catalyst.plans.logical.Range) | $numElementsTodo = 0; | if ($nextBatchTodo == 0) break; | } - | $numOutput.add($nextBatchTodo); - | $inputMetrics.incRecordsRead($nextBatchTodo); | $batchEnd += $nextBatchTodo * ${step}L; | } | | int $localEnd = (int)(($batchEnd - $nextIndex) / ${step}L); | for (int $localIdx = 0; $localIdx < $localEnd; $localIdx++) { | long $value = ((long)$localIdx * ${step}L) + $nextIndex; | ${consume(ctx, Seq(ev))} - | $shouldStop + | $stopCheck | } | $nextIndex = $batchEnd; + | $numOutput.add($localEnd); --- End diff -- that's expected isn't it? The range operator does output 1000 rows, the limit operator takes 1000 inputs, but only output like 100 rows. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputs...
Github user maryannxue commented on a diff in the pull request: https://github.com/apache/spark/pull/22701#discussion_r224658264 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -2150,8 +2150,10 @@ class Analyzer( // TODO: skip null handling for not-nullable primitive inputs after we can completely // trust the `nullable` information. +val needsNullCheck = (nullable: Boolean, expr: Expression) => --- End diff -- Yes, that's because "nullableType" is flipped around here. "nullableType" should really be "cantBeNull" or "doesntNeedNullCheck". I'll change this in other PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22674 **[Test build #97291 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97291/testReport)** for PR 22674 at commit [`6e3a345`](https://github.com/apache/spark/commit/6e3a345dd2cfc8071efdacf2a37677a588e00b6d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3910/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224655860 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- Since this is complicated, shall we put a comment to explain it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22019: [WIP][SPARK-25040][SQL] Empty string for double and floa...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22019 @viirya and @MaxGekk, are you busy? Do you mind if I ask to take over this? we will completely disallow empty strings in other types and target it 3.0.0. The changes wouldn't be too much and it requires to update the migration guide. I will be busy for a couple of weeks so I would appreciate it if you find some time to take over this. Otherwise, I will start to work on this after a couple of weeks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20125: [SPARK-17967][SQL] Support for array as an option in SQL...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20125 I am sorry it's been inactive. Let me update this one within a week. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22702: [SPARK-25714] Fix Null Handling in the Optimizer ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22702#discussion_r224655771 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -276,15 +276,15 @@ object BooleanSimplification extends Rule[LogicalPlan] with PredicateHelper { case a And b if a.semanticEquals(b) => a case a Or b if a.semanticEquals(b) => a - case a And (b Or c) if Not(a).semanticEquals(b) => And(a, c) - case a And (b Or c) if Not(a).semanticEquals(c) => And(a, b) - case (a Or b) And c if a.semanticEquals(Not(c)) => And(b, c) - case (a Or b) And c if b.semanticEquals(Not(c)) => And(a, c) - - case a Or (b And c) if Not(a).semanticEquals(b) => Or(a, c) - case a Or (b And c) if Not(a).semanticEquals(c) => Or(a, b) - case (a And b) Or c if a.semanticEquals(Not(c)) => Or(b, c) - case (a And b) Or c if b.semanticEquals(Not(c)) => Or(a, c) + case a And (b Or c) if !a.nullable && Not(a).semanticEquals(b) => And(a, c) --- End diff -- assuming a is null, then b is also null. If c is null: `a And (b Or c)` -> null, And(a, c) -> null If c is true: `a And (b Or c)` -> null, And(a, c) -> null if c is false: `a And (b Or c)` -> null, And(a, c) -> false So yes this is a bug, and we should rewrite it to `If(IsNull(a), a, And(a, c))` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20877: [SPARK-23765][SQL] Supports custom line separator for js...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20877 @MaxGekk, are you busy? Do you have some time to go for CSV's lineSep? I think I wouldn't have some time within a couple of weeks. If you have some time, I would appreciate if you could go ahead. Otherwise, I will try this one after a couple of weeks. The problem in CSV's lineSep is about multiline support. As you might already know, CSV's multiline mode is different with JSON in a way it parses line by line from the stream whereas JSON treats it as a whole record in general - so we should set the lineSep to Univocity parser as well. The problem is, `lineSep` at Univocity parser has some limitation (https://github.com/apache/spark/pull/18581#issuecomment-314037750 and see also `https://github.com/uniVocity/univocity-parsers/issues/170`). There are some changes made in https://github.com/apache/spark/pull/18581 . Might able to extract CSV related change and make some addition and deletion. If it's difficult to support `lineSep` more than one characters by the limitation, I think we can restrict the lineSep only to one character in `multiLine` mode. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22705: [SPARK-25704][CORE][WIP] Allocate a bit less than Int.Ma...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22705 **[Test build #97290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97290/testReport)** for PR 22705 at commit [`cb07bad`](https://github.com/apache/spark/commit/cb07badcd853da0e4083b7e02bdfdf86c9d295f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22705: [SPARK-25704][CORE][WIP] Allocate a bit less than Int.Ma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22705 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22705: [SPARK-25704][CORE][WIP] Allocate a bit less than Int.Ma...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22705 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3909/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22705: [SPARK-25704][CORE][WIP] Allocate a bit less than...
GitHub user squito opened a pull request: https://github.com/apache/spark/pull/22705 [SPARK-25704][CORE][WIP] Allocate a bit less than Int.MaxValue JVMs don't you allocate arrays of length exactly Int.MaxValue, so leave a little extra room. This is necessary when reading blocks >2GB off the network (for remote reads or for cache replication). WIP because I'm still running tests on a real cluster You can merge this pull request into a Git repository by running: $ git pull https://github.com/squito/spark SPARK-25704 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22705.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22705 commit cb07badcd853da0e4083b7e02bdfdf86c9d295f1 Author: Imran Rashid Date: 2018-10-12T01:54:34Z [SPARK-25704][CORE] Allocate a bit less than Int.MaxValue JVMs don't you allocate arrays of length exactly Int.MaxValue, so leave a little extra room. This is necessary when reading blocks >2GB off the network (for remote reads or for cache replication). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22655: [SPARK-25666][PYTHON] Internally document type conversio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22655 @viirya and @BryanCutler, do you guys have some time to go for Pandas one? I think I wouldn't have some time within a couple of weeks. If you guys have some time, I would appreciate if you could go ahead. Otherwise, I will start this one after a couple of weeks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Allow start-history-server.sh to sho...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/22699 Let's also update the title to include the deprecation changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22593: [Streaming][DOC] Fix typo & format in DataStreamWriter.s...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22593 Also, let's mention this PR targets to fix javadoc in the PR description and/or title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22593: [Streaming][DOC] Fix typo & format in DataStreamWriter.s...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22593 Also, let's mention this PR targets to fix javadoc in the PR description, title and/or JIRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22699: [SPARK-25711][Core] Allow start-history-server.sh to sho...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22699 **[Test build #4373 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/4373/testReport)** for PR 22699 at commit [`5e05c60`](https://github.com/apache/spark/commit/5e05c604fdc9913a1424a569deb16ec3301bd4e4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputsForUDF ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22701 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97283/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputsForUDF ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22701 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22701: [SPARK-25690][SQL] Analyzer rule HandleNullInputsForUDF ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22701 **[Test build #97283 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97283/testReport)** for PR 22701 at commit [`dfa301e`](https://github.com/apache/spark/commit/dfa301ebdf289d6501a8c0edf44e35e76a043c7d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22695: [MINOR][SQL]remove Redundant semicolons
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/22695 @srowen,thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Looks pretty much getting close to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22379 Looks pretty mush getting close to go. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224649633 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala --- @@ -19,8 +19,8 @@ package org.apache.spark.sql.execution.datasources.csv import org.apache.spark.rdd.RDD import org.apache.spark.sql.Dataset +import org.apache.spark.sql.catalyst.csv.CSVOptions import org.apache.spark.sql.functions._ -import org.apache.spark.sql.types._ object CSVUtils { --- End diff -- @MaxGekk, actually I was wondering if it's difficult to move this under catalyst package as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224649495 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3854,6 +3854,38 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + /** + * Parses a column containing a CSV string into a `StructType` with the specified schema. + * Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { --- End diff -- I would like to suggest to avoid adding overridden versions for now ... it has one Java specific version so should be usable in Java side. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22666: [SPARK-25672][SQL] schema_of_csv() - schema inference fr...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22666 Let's add from_csv first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224649188 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -3854,6 +3854,38 @@ object functions { @scala.annotation.varargs def map_concat(cols: Column*): Column = withExpr { MapConcat(cols.map(_.expr)) } + /** + * Parses a column containing a CSV string into a `StructType` with the specified schema. + * Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: StructType, options: Map[String, String]): Column = withExpr { +CsvToStructs(schema, options, e.expr) + } + + /** + * (Java-specific) Parses a column containing a CSV string into a `StructType` + * with the specified schema. Returns `null`, in the case of an unparseable string. + * + * @param e a string column containing CSV data. + * @param schema the schema to use when parsing the CSV string + * @param options options to control how the CSV is parsed. accepts the same options and the + *CSV data source. + * + * @group collection_funcs + * @since 3.0.0 + */ + def from_csv(e: Column, schema: String, options: java.util.Map[String, String]): Column = { --- End diff -- @MaxGekk, can we replace `schema: String` to `schema: Column` for `schema_of_csv`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22575: [SPARK-24630][SS] Support SQLStreaming in Spark
Github user stczwd commented on the issue: https://github.com/apache/spark/pull/22575 @WangTaoTheTonic Adding 'stream' keyword has two purposes: - **Mark the entire sql query as a stream query and generate the SQLStreaming plan tree.** - **Mark the table type as UnResolvedStreamRelation.** Parse the table as StreamingRelation or other Relation, especially in the stream join batch queries, such as kafka join mysql. **Besides, the keyword 'stream' makes it easier to express StructStreaming with pure SQL.** A little example to show importances of 'stream': read stream from kafka stream table, and join mysql to count user message - with 'stream' - `select stream kafka_sql_test.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name` - **It will be regarded as Streaming Query using Console Sink**, the kafka_sql_test will be parsed as StreamingRelation and mysql_test will be parsed as JDBCRelation, not Streaming Relation. - `insert into csv_sql_table select stream kafka_sql_test.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name` - **It will be regarded as Streaming Query using FileStream Sink**, the kafka_sql_test will be parsed as StreamingRelation and mysql_test will be parsed as JDBCRelation, not Streaming Relation. - without 'stream' - `select kafka_sql.name, count(door) from kafka_sql_test inner join mysql_test on kafka_sql_test.name == mysql_test.name group by kafka_sql_test.name` - **It will be regarded as Batch Query**, the kafka_sql_test will be parsed to KafkaRelation and mysql_test will be parsed as JDBCRelation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3908/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22696 **[Test build #97289 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97289/testReport)** for PR 22696 at commit [`b0dc140`](https://github.com/apache/spark/commit/b0dc140cd125498070143f67abf51204373fa14c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224648638 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVUtils.scala --- @@ -40,16 +40,6 @@ object CSVUtils { } } - /** - * Filter ignorable rows for CSV iterator (lines empty and starting with `comment`). - * This is currently being used in CSV reading path and CSV schema inference. - */ - def filterCommentAndEmpty(iter: Iterator[String], options: CSVOptions): Iterator[String] = { --- End diff -- nope. It's under execution package. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22676: [SPARK-25684][SQL] Organize header related codes ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22676 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22379: [SPARK-25393][SQL] Adding new function from_csv()
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22379#discussion_r224648258 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala --- @@ -254,7 +256,7 @@ object TextInputCSVDataSource extends CSVDataSource { val header = makeSafeHeader(firstRow, caseSensitive, parsedOptions) val sampled: Dataset[String] = CSVUtils.sample(csv, parsedOptions) val tokenRDD = sampled.rdd.mapPartitions { iter => - val filteredLines = CSVUtils.filterCommentAndEmpty(iter, parsedOptions) + val filteredLines = filterCommentAndEmpty(iter, parsedOptions) --- End diff -- not a big deal but let's just use `CSVUtils...` usage just for consistency in this file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22676 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22676: [SPARK-25684][SQL] Organize header related codes in CSV ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22676 Thank you @cloud-fan and @MaxGekk for reviewing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22697: [SPARK-25700][SQL][BRANCH-2.4] Partially revert append m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22697 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97281/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22697: [SPARK-25700][SQL][BRANCH-2.4] Partially revert append m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22697 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97277/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22674 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22697: [SPARK-25700][SQL][BRANCH-2.4] Partially revert append m...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22697 **[Test build #97281 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97281/testReport)** for PR 22697 at commit [`b836625`](https://github.com/apache/spark/commit/b836625c0d4404d1ca885d172cef5f820efc187c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22674: [SPARK-25680][SQL] SQL execution listener shouldn't happ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22674 **[Test build #97277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97277/testReport)** for PR 22674 at commit [`0bfc240`](https://github.com/apache/spark/commit/0bfc2408a5941d7da8d93582668ba77a7394ac66). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21364: [SPARK-24317][SQL]Float-point numbers are displayed with...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21364 cc @srinathshankar @yuchenhuo --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22702 **[Test build #97288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97288/testReport)** for PR 22702 at commit [`a9359ab`](https://github.com/apache/spark/commit/a9359abff62017f46f33ef18d7f56f97c885af3d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3907/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22702 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22696 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97280/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22696: [SPARK-25708][SQL] HAVING without GROUP BY means global ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22696 **[Test build #97280 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97280/testReport)** for PR 22696 at commit [`78a1689`](https://github.com/apache/spark/commit/78a1689ecd7854a11ba709853462897d5e0d1a28). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22614: [SPARK-25561][SQL] Implement a new config to cont...
Github user tejasapatil commented on a diff in the pull request: https://github.com/apache/spark/pull/22614#discussion_r224639756 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala --- @@ -746,34 +746,20 @@ private[client] class Shim_v0_13 extends Shim_v0_12 { getAllPartitionsMethod.invoke(hive, table).asInstanceOf[JSet[Partition]] } else { logDebug(s"Hive metastore filter is '$filter'.") -val tryDirectSqlConfVar = HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL -// We should get this config value from the metaStore. otherwise hit SPARK-18681. -// To be compatible with hive-0.12 and hive-0.13, In the future we can achieve this by: -// val tryDirectSql = hive.getMetaConf(tryDirectSqlConfVar.varname).toBoolean -val tryDirectSql = hive.getMSC.getConfigValue(tryDirectSqlConfVar.varname, - tryDirectSqlConfVar.defaultBoolVal.toString).toBoolean try { // Hive may throw an exception when calling this method in some circumstances, such as - // when filtering on a non-string partition column when the hive config key - // hive.metastore.try.direct.sql is false + // when filtering on a non-string partition column. getPartitionsByFilterMethod.invoke(hive, table, filter) .asInstanceOf[JArrayList[Partition]] } catch { - case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] && - !tryDirectSql => + case ex: InvocationTargetException if ex.getCause.isInstanceOf[MetaException] => --- End diff -- @kmanamcheri : Lets do this: - We should prefer doing `getPartitionsByFilterMethod()`. If it fails, we retry with increasing delay across retries. - If retries are exhausted, we could fetch all the partitions of the table. Some people might not want this so lets control this using a conf flag. For those who don't want it, the query could fail at this point. What do you think ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97284/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22702 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22702: [SPARK-25714] Fix Null Handling in the Optimizer rule Bo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22702 **[Test build #97284 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97284/testReport)** for PR 22702 at commit [`a9359ab`](https://github.com/apache/spark/commit/a9359abff62017f46f33ef18d7f56f97c885af3d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22661 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22661: [SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use m...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22661 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3906/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org