[GitHub] [spark] c21 commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631610336 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -724,9 +749,32 @@ case class SortMergeJoinExec( """.stripMargin } +lazy val semiJoin = { Review comment: @maropu - yes I was thinking at the first place but worried about number of parameters to be too many. Refined the code a bit and updated now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631610541 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -424,8 +424,18 @@ case class SortMergeJoinExec( // A list to hold all matched rows from buffered side. val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName +// Flag to only buffer first matched row, to avoid buffering unnecessary rows. +val onlyBufferFirstMatchedRow = (joinType, condition) match { + case (LeftSemi, None) => true + case _ => false +} +val inMemoryThreshold = + if (onlyBufferFirstMatchedRow) { Review comment: Good call. Actually the non-code-gen path can also depend on this, so I make it just a `val` now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840363206 To ease for review, the change for all plan files is used by followed command: ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite" ``` None of them are updated manually. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
dongjoon-hyun commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840363273 The Python module UT failure is SPARK-35392 and will be fixed in master branch via https://github.com/apache/spark/pull/32533 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
SparkQA commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840365448 **[Test build #138491 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138491/testReport)** for PR 32531 at commit [`c6ce0b7`](https://github.com/apache/spark/commit/c6ce0b720c114b962e73af1a989eb113df3ec70a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
SparkQA removed a comment on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840290823 **[Test build #138491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138491/testReport)** for PR 32531 at commit [`c6ce0b7`](https://github.com/apache/spark/commit/c6ce0b720c114b962e73af1a989eb113df3ec70a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
dongjoon-hyun commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840365894 Hi, @HyukjinKwon . Could you review this PR please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
SparkQA commented on pull request #32532: URL: https://github.com/apache/spark/pull/32532#issuecomment-840373603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
SparkQA commented on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840373369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
cloud-fan commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631619359 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -679,60 +704,124 @@ case class SortMergeJoinExec( (evaluateVariables(streamedVars), "") } -val thisPlan = ctx.addReferenceObj("plan", this) -val eagerCleanup = s"$thisPlan.cleanupResources();" - -lazy val innerJoin = +val beforeLoop = s""" - |while (findNextJoinRows($streamedInput, $bufferedInput)) { - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | while ($iterator.hasNext()) { - |InternalRow $bufferedRow = (InternalRow) $iterator.next(); - |${condCheck.trim} - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup - """.stripMargin - -lazy val outerJoin = { - val hasOutputRow = ctx.freshName("hasOutputRow") + |${streamedVarDecl.mkString("\n")} + |${streamedBeforeLoop.trim} + |scala.collection.Iterator $iterator = $matches.generateIterator(); + """.stripMargin +val outputRow = s""" - |while ($streamedInput.hasNext()) { - | findNextJoinRows($streamedInput, $bufferedInput); - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | boolean $hasOutputRow = false; - | - | // the last iteration of this loop is to emit an empty row if there is no matched rows. - | while ($iterator.hasNext() || !$hasOutputRow) { - |InternalRow $bufferedRow = $iterator.hasNext() ? - | (InternalRow) $iterator.next() : null; - |${condCheck.trim} - |$hasOutputRow = true; - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup + |$numOutput.add(1); + |${consume(ctx, resultVars)} """.stripMargin -} +val findNextJoinRows = s"findNextJoinRows($streamedInput, $bufferedInput)" +val thisPlan = ctx.addReferenceObj("plan", this) +val eagerCleanup = s"$thisPlan.cleanupResources();" joinType match { - case _: InnerLike => innerJoin - case LeftOuter | RightOuter => outerJoin + case _: InnerLike => +codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, Review comment: shall we pass `beforeLoop.trim` so that we don't need to do it in all the 3 methods? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840374735 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138482/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
AmplabJenkins commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840374732 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138491/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
AmplabJenkins commented on pull request #32532: URL: https://github.com/apache/spark/pull/32532#issuecomment-840374733 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43019/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
cloud-fan commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631620203 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -679,60 +704,124 @@ case class SortMergeJoinExec( (evaluateVariables(streamedVars), "") } -val thisPlan = ctx.addReferenceObj("plan", this) -val eagerCleanup = s"$thisPlan.cleanupResources();" - -lazy val innerJoin = +val beforeLoop = s""" - |while (findNextJoinRows($streamedInput, $bufferedInput)) { - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | while ($iterator.hasNext()) { - |InternalRow $bufferedRow = (InternalRow) $iterator.next(); - |${condCheck.trim} - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup - """.stripMargin - -lazy val outerJoin = { - val hasOutputRow = ctx.freshName("hasOutputRow") + |${streamedVarDecl.mkString("\n")} + |${streamedBeforeLoop.trim} + |scala.collection.Iterator $iterator = $matches.generateIterator(); + """.stripMargin +val outputRow = s""" - |while ($streamedInput.hasNext()) { - | findNextJoinRows($streamedInput, $bufferedInput); - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | boolean $hasOutputRow = false; - | - | // the last iteration of this loop is to emit an empty row if there is no matched rows. - | while ($iterator.hasNext() || !$hasOutputRow) { - |InternalRow $bufferedRow = $iterator.hasNext() ? - | (InternalRow) $iterator.next() : null; - |${condCheck.trim} - |$hasOutputRow = true; - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup + |$numOutput.add(1); + |${consume(ctx, resultVars)} """.stripMargin -} +val findNextJoinRows = s"findNextJoinRows($streamedInput, $bufferedInput)" +val thisPlan = ctx.addReferenceObj("plan", this) +val eagerCleanup = s"$thisPlan.cleanupResources();" joinType match { - case _: InnerLike => innerJoin - case LeftOuter | RightOuter => outerJoin + case _: InnerLike => +codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, + eagerCleanup) + case LeftOuter | RightOuter => +codegenOuter(streamedInput, findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, + ctx.freshName("hasOutputRow"), outputRow, eagerCleanup) + case LeftSemi => +codegenSemi(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, + ctx.freshName("hasOutputRow"), outputRow, eagerCleanup) case x => throw new IllegalArgumentException( s"SortMergeJoin.doProduce should not take $x as the JoinType") } } + /** + * Generates the code for Inner join. + */ + private def codegenInner( + findNextJoinRows: String, + beforeLoop: String, + matchIterator: String, + bufferedRow: String, + conditionCheck: String, + outputRow: String, + eagerCleanup: String): String = { +s""" + |while ($findNextJoinRows) { + | ${beforeLoop.trim} + | while ($matchIterator.hasNext()) { + |InternalRow $bufferedRow = (InternalRow) $matchIterator.next(); + |${conditionCheck.trim} + |$outputRow + | } + | if (shouldStop()) return; + |} + |$eagerCleanup + """.stripMargin + } + + /** + * Generates the code for Left or Right Outer join. + */ + private def codegenOuter( + streamedInput: String, + findNextJoinRows: String, + beforeLoop: String, + matchIterator: String, + bufferedRow: String, + conditionCheck: String, + hasOutputRow: String, + outputRow: String, + eagerCleanup: String): String = { +s""" + |while ($streamedInput.hasNext()) { + | $findNextJoinRows; + | ${beforeLoop.trim} + | boolean $hasOutputRow = false; + | + | // the last iteration of this loop is to emit an empty row if there is no matched rows. + | while ($matchIterator.hasNext() || !$hasOutputRow) { + |InternalRow $bufferedRow = $matchIterator.hasNext() ? + | (InternalRow) $matchIterator.next() : null; + |${conditionCheck.trim} + |$hasOutputRow = true; + |$outputRow + | } + | if (shouldStop()) return; + |} +
[GitHub] [spark] AmplabJenkins commented on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
AmplabJenkins commented on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840374731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
AmplabJenkins removed a comment on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840374731 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
AmplabJenkins removed a comment on pull request #32532: URL: https://github.com/apache/spark/pull/32532#issuecomment-840374733 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43019/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
AmplabJenkins removed a comment on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840374732 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138491/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840374735 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138482/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
cloud-fan commented on pull request #32411: URL: https://github.com/apache/spark/pull/32411#issuecomment-840376552 The new behavior makes sense to me. I'm wondering what shall we do for data source tables. e.g. ``` CREATE TABLE t USING jdbc OPTIONS (... dbtable="foo") AS SELECT ... ``` The "foo" in JDBC is not empty, what shall we do? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
SparkQA commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840376956 **[Test build #138501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138501/testReport)** for PR 32478 at commit [`4509cbb`](https://github.com/apache/spark/commit/4509cbbbe00debe19b670c463c85fb7fe295c8c3). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
SparkQA commented on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840376871 **[Test build #138500 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138500/testReport)** for PR 32528 at commit [`979c759`](https://github.com/apache/spark/commit/979c75928b0c2ca95fdf4820531a1cb6461129a6). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
SparkQA commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840377216 **[Test build #138502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138502/testReport)** for PR 32292 at commit [`91bdc4c`](https://github.com/apache/spark/commit/91bdc4cb30d911684cc693e9a7cd8b674e53e2e1). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a change in pull request #32411: [SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory.
vinodkc commented on a change in pull request #32411: URL: https://github.com/apache/spark/pull/32411#discussion_r631622569 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala ## @@ -96,4 +98,23 @@ object DataWritingCommand { sparkContext.getLocalProperty(SQLExecution.EXECUTION_ID_KEY), metrics.values.toSeq) } + + def assertEmptyRootPath(tablePath: URI, saveMode: SaveMode, sparkSession: SparkSession) { +if (saveMode != SaveMode.Overwrite && + !sparkSession.sqlContext.conf.allowCreateTableAsSelectNonEmptyLocation) { + val filePath = new org.apache.hadoop.fs.Path(tablePath) + val fs = filePath.getFileSystem(sparkSession.sparkContext.hadoopConfiguration) + if(fs != null && fs.exists(filePath)) { +val locStats = fs.getFileStatus(filePath) +if(locStats != null && locStats.isDirectory) { + val lStats = fs.listStatus(filePath) + if(lStats != null && lStats.length != 0) { +throw new AnalysisException( + s"CREATE-TABLE-AS-SELECT cannot create table" + Review comment: In Hive , https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java#L408 Error message is `CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory.` To be consistent, shall we show this message? "CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory. To disable the behavior, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS to true" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32496: [SPARK-35207][SQL] Normalize hash function behavior with negative zero
cloud-fan commented on a change in pull request #32496: URL: https://github.com/apache/spark/pull/32496#discussion_r631622732 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ## @@ -369,11 +369,25 @@ abstract class HashExpression[E] extends Expression { protected def genHashBoolean(input: String, result: String): String = genHashInt(s"$input ? 1 : 0", result) - protected def genHashFloat(input: String, result: String): String = -genHashInt(s"Float.floatToIntBits($input)", result) + protected def genHashFloat(input: String, result: String): String = { +s""" + |if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) { Review comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32496: [SPARK-35207][SQL] Normalize hash function behavior with negative zero
cloud-fan commented on a change in pull request #32496: URL: https://github.com/apache/spark/pull/32496#discussion_r631622732 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala ## @@ -369,11 +369,25 @@ abstract class HashExpression[E] extends Expression { protected def genHashBoolean(input: String, result: String): String = genHashInt(s"$input ? 1 : 0", result) - protected def genHashFloat(input: String, result: String): String = -genHashInt(s"Float.floatToIntBits($input)", result) + protected def genHashFloat(input: String, result: String): String = { +s""" + |if(Float.floatToIntBits($input) == Float.floatToIntBits(-0.0f)) { Review comment: +1, `$input == 0.0f` should be good enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32496: [SPARK-35207][SQL] Normalize hash function behavior with negative zero
cloud-fan commented on a change in pull request #32496: URL: https://github.com/apache/spark/pull/32496#discussion_r631623223 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala ## @@ -654,4 +654,30 @@ class WholeStageCodegenSuite extends QueryTest with SharedSparkSession } } } + + test("SPARK-35207: Compute hash consistent between -0.0 and 0.0 doubles with Codegen") { Review comment: +1, if you use `checkEvaluation`, both codegen and interpreted are checked. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
SparkQA commented on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840378127 **[Test build #138496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138496/testReport)** for PR 32410 at commit [`4bca8ec`](https://github.com/apache/spark/commit/4bca8ecaec066ef19d04a12e134ba830320a2e0f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
SparkQA removed a comment on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840310594 **[Test build #138496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138496/testReport)** for PR 32410 at commit [`4bca8ec`](https://github.com/apache/spark/commit/4bca8ecaec066ef19d04a12e134ba830320a2e0f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
AmplabJenkins commented on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840378916 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138496/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
cloud-fan commented on a change in pull request #32482: URL: https://github.com/apache/spark/pull/32482#discussion_r631624087 ## File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ## @@ -1554,4 +1554,65 @@ class CachedTableSuite extends QueryTest with SQLTestUtils assert(!spark.catalog.isCached(viewName)) } } + + test("SPARK-35332: Make cache plan disable configs configurable - check AQE") { +withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2", + SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { + + withTempView("t1", "t2", "t3") { +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + sql("CACHE TABLE t1 as SELECT /*+ REPARTITION */ * FROM values(1) as t(c)") + assert(spark.table("t1").rdd.partitions.length == 2) + + withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { +assert(spark.table("t1").rdd.partitions.length == 2) +sql("CACHE TABLE t2 as SELECT /*+ REPARTITION */ * FROM values(2) as t(c)") +assert(spark.table("t2").rdd.partitions.length == 1) + +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { Review comment: why is this one nested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
AmplabJenkins removed a comment on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840378916 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138496/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
cloud-fan commented on a change in pull request #32482: URL: https://github.com/apache/spark/pull/32482#discussion_r631624820 ## File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ## @@ -1554,4 +1554,65 @@ class CachedTableSuite extends QueryTest with SQLTestUtils assert(!spark.catalog.isCached(viewName)) } } + + test("SPARK-35332: Make cache plan disable configs configurable - check AQE") { +withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2", + SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { + + withTempView("t1", "t2", "t3") { +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + sql("CACHE TABLE t1 as SELECT /*+ REPARTITION */ * FROM values(1) as t(c)") + assert(spark.table("t1").rdd.partitions.length == 2) + + withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { +assert(spark.table("t1").rdd.partitions.length == 2) +sql("CACHE TABLE t2 as SELECT /*+ REPARTITION */ * FROM values(2) as t(c)") +assert(spark.table("t2").rdd.partitions.length == 1) + +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { Review comment: I'm expecting something like ``` withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { test1 } withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { test2 } withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { test3 } ``` ## File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ## @@ -1554,4 +1554,65 @@ class CachedTableSuite extends QueryTest with SQLTestUtils assert(!spark.catalog.isCached(viewName)) } } + + test("SPARK-35332: Make cache plan disable configs configurable - check AQE") { +withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2", + SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { + + withTempView("t1", "t2", "t3") { +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + sql("CACHE TABLE t1 as SELECT /*+ REPARTITION */ * FROM values(1) as t(c)") + assert(spark.table("t1").rdd.partitions.length == 2) + + withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { +assert(spark.table("t1").rdd.partitions.length == 2) +sql("CACHE TABLE t2 as SELECT /*+ REPARTITION */ * FROM values(2) as t(c)") +assert(spark.table("t2").rdd.partitions.length == 1) + +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + assert(spark.table("t1").rdd.partitions.length == 2) + assert(spark.table("t2").rdd.partitions.length == 1) + sql("CACHE TABLE t3 as SELECT /*+ REPARTITION */ * FROM values(3) as t(c)") + assert(spark.table("t3").rdd.partitions.length == 2) +} + } +} + } +} + } + + test("SPARK-35332: Make cache plan disable configs configurable - check bucket scan") { +withTable("t1", "t2", "t3") { + Seq(1, 2, 3).foreach { i => +spark.range(1, 2) + .write + .format("parquet") + .bucketBy(2, "id") + .saveAsTable(s"t$i") + } + + withCache("t1", "t2", "t3") { +withSQLConf(SQLConf.BUCKETING_ENABLED.key -> "true", + SQLConf.FILES_MIN_PARTITION_NUM.key -> "1", + SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + sql("CACHE TABLE t1") + assert(spark.table("t1").rdd.partitions.length == 2) + + withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { +assert(spark.table("t1").rdd.partitions.length == 2) +sql("CACHE TABLE t2") +assert(spark.table("t2").rdd.partitions.length == 1) + +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { Review comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #32365: [SPARK-35228][SQL] Add expression ToPrettyString for keep consistent between hive/spark format in df.show and transform
AngersZh commented on a change in pull request #32365: URL: https://github.com/apache/spark/pull/32365#discussion_r631625554 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala ## @@ -2649,3 +2652,78 @@ case class Sentences( copy(str = newFirst, language = newSecond, country = newThird) } + +case class ToPrettyString(child: Expression, timeZoneId: Option[String] = None) + extends UnaryExpression with TimeZoneAwareExpression { + import ToPrettyString._ + + override def withTimeZone(timeZoneId: String): TimeZoneAwareExpression = +copy(timeZoneId = Option(timeZoneId)) + override def dataType: DataType = StringType + + private val timeFormatters: TimeFormatters = +TimeFormatters(DateFormatter(zoneId), TimestampFormatter.getFractionFormatter(zoneId)) + + override def nullSafeEval(input: Any): Any = { +UTF8String.fromString(toHiveString((input, child.dataType), false, timeFormatters)) + } + + override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { +nullSafeCodeGen(ctx, ev, eval => { + val toHiveString = ToPrettyString.getClass.getName.stripSuffix("$") + val tuple2 = Tuple2.getClass.getName.stripSuffix("$") + val dataType = JavaCode.global( +ctx.addReferenceObj("dataType", child.dataType), +child.dataType.getClass) + val formatter = JavaCode.global( +ctx.addReferenceObj("dateFormatter", timeFormatters), +timeFormatters.getClass) + s"""${ev.value} = UTF8String.fromString($toHiveString.toHiveString( + |$tuple2.apply($eval, ${dataType}), false, $formatter));""".stripMargin +}) + } + + override def prettyName: String = "to_hive_string" + + override protected def withNewChildInternal(newChild: Expression): Expression = +ToPrettyString(newChild) +} + +object ToPrettyString { + case class TimeFormatters(date: DateFormatter, timestamp: TimestampFormatter) + + def toHiveString( + a: (Any, DataType), + nested: Boolean, + formatters: TimeFormatters): String = a match { Review comment: > This is inefficient in codegen, as we know the data type ahead but we are still doing runtime pattern match. > > Can we follow `Cast` to implement the expression? Done, can you recheck this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
dongjoon-hyun commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840382268 Oh, thank you so much, @viirya ! Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
dongjoon-hyun closed pull request #32531: URL: https://github.com/apache/spark/pull/32531 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuechendi opened a new pull request #32534: [WIP][SPARK-35396]Add AutoCloseable close to BlockManager and InMemoryRelation
xuechendi opened a new pull request #32534: URL: https://github.com/apache/spark/pull/32534 This PR is proposing a add-on to support to manual close entries in MemoryStore and InMemoryRelation ### What changes were proposed in this pull request? Currently: MemoryStore uses a LinkedHashMap[BlockId, MemoryEntry[_]] to store all OnHeap or OffHeap entries. And when memoryStore.remove(blockId) is called, codes will simply remove one entry from LinkedHashMap and leverage Java GC to do release work. This PR: We are proposing a add-on to manually close any object stored in MemoryStore and InMemoryRelation if this object is extended from AutoCloseable. Veifiication: In our own use case, we implemented a user-defined off-heap-hashRelation for BHJ, and we verified that by adding this manual close, we can make sure our defined off-heap-hashRelation can be released when evict is called. Also, we implemented user-defined cachedBatch and will be release when InMemoryRelation.clearCache() is called by this PR ### Why are the changes needed? This changes can help to clean some off-heap user-defined object may be cached in InMemoryRelation or MemoryStore ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? WIP Signed-off-by: Chendi Xue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32534: [WIP][SPARK-35396]Add AutoCloseable close to BlockManager and InMemoryRelation
AmplabJenkins commented on pull request #32534: URL: https://github.com/apache/spark/pull/32534#issuecomment-840383881 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
viirya opened a new pull request #32535: URL: https://github.com/apache/spark/pull/32535 ### What changes were proposed in this pull request? This patch replaces `sys.err` usages with explicit exception types. ### Why are the changes needed? Motivated by the previous comment https://github.com/apache/spark/pull/32519#discussion_r630787080, it sounds better to replace `sys.err` usages with explicit exception type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
dongjoon-hyun commented on a change in pull request #32532: URL: https://github.com/apache/spark/pull/32532#discussion_r631629800 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -145,12 +151,7 @@ trait InvokeLike extends Expression with NonSQLExpression { case e: java.lang.reflect.InvocationTargetException if e.getCause != null => throw e.getCause } - val boxedClass = ScalaReflection.typeBoxedJavaMapping.get(dataType) - if (boxedClass.isDefined) { -boxedClass.get.cast(ret) - } else { -ret - } + boxingFn(ret) Review comment: > This could speed up the performance a bit. Are you going to attach the new benchmark result? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
dongjoon-hyun commented on a change in pull request #32535: URL: https://github.com/apache/spark/pull/32535#discussion_r631630221 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -3338,7 +3338,8 @@ class Analyzer(override val catalogManager: CatalogManager) case _ : InnerLike => (leftKeys ++ lUniqueOutput ++ rUniqueOutput, rightKeys) case _ => -sys.error("Unsupported natural join type " + joinType) +throw QueryExecutionErrors.unsupportedNaturalJoinTypeError(joinType) + Review comment: nit. extra space. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840387634 **[Test build #138494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138494/testReport)** for PR 32498 at commit [`b7a6cc7`](https://github.com/apache/spark/commit/b7a6cc71110fe8de45e8c74d487ebd23b7942f34). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840310425 **[Test build #138494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138494/testReport)** for PR 32498 at commit [`b7a6cc7`](https://github.com/apache/spark/commit/b7a6cc71110fe8de45e8c74d487ebd23b7942f34). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] YuzhouSun commented on pull request #32207: [SPARK-35106] Avoid failing rename in HadoopMapReduceCommitProtocol with dynamic partition overwrite
YuzhouSun commented on pull request #32207: URL: https://github.com/apache/spark/pull/32207#issuecomment-840387892 Created https://github.com/apache/spark/pull/32530. @cloud-fan Could you review it? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840388226 **[Test build #138485 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138485/testReport)** for PR 32498 at commit [`1cf79bf`](https://github.com/apache/spark/commit/1cf79bf05bd66446629628aac99828bc776f012c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840264723 **[Test build #138485 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138485/testReport)** for PR 32498 at commit [`1cf79bf`](https://github.com/apache/spark/commit/1cf79bf05bd66446629628aac99828bc776f012c). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840389098 **[Test build #138486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138486/testReport)** for PR 32494 at commit [`bd838ac`](https://github.com/apache/spark/commit/bd838ac955ec62a817360347325a5be83b49be4a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA removed a comment on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840264760 **[Test build #138486 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138486/testReport)** for PR 32494 at commit [`bd838ac`](https://github.com/apache/spark/commit/bd838ac955ec62a817360347325a5be83b49be4a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] fhygh commented on a change in pull request #32501: [SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation
fhygh commented on a change in pull request #32501: URL: https://github.com/apache/spark/pull/32501#discussion_r631638329 ## File path: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java ## @@ -26,7 +27,7 @@ private static UTF8String trimTrailingSpaces( UTF8String inputStr, int numChars, int limit) { int numTailSpacesToTrim = numChars - limit; UTF8String trimmed = inputStr.trimTrailingSpaces(numTailSpacesToTrim); -if (trimmed.numChars() > limit) { +if (trimmed.numChars() > limit && !SQLConf.get().charVarcharAsString()) { Review comment: then do we need update PartitioningUtils.charTypeWriteSideCheck and PartitioningUtils.varcharTypeWriteSideCheck ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
SparkQA commented on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840401476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631646593 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -679,60 +704,124 @@ case class SortMergeJoinExec( (evaluateVariables(streamedVars), "") } -val thisPlan = ctx.addReferenceObj("plan", this) -val eagerCleanup = s"$thisPlan.cleanupResources();" - -lazy val innerJoin = +val beforeLoop = s""" - |while (findNextJoinRows($streamedInput, $bufferedInput)) { - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | while ($iterator.hasNext()) { - |InternalRow $bufferedRow = (InternalRow) $iterator.next(); - |${condCheck.trim} - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup - """.stripMargin - -lazy val outerJoin = { - val hasOutputRow = ctx.freshName("hasOutputRow") + |${streamedVarDecl.mkString("\n")} + |${streamedBeforeLoop.trim} + |scala.collection.Iterator $iterator = $matches.generateIterator(); + """.stripMargin +val outputRow = s""" - |while ($streamedInput.hasNext()) { - | findNextJoinRows($streamedInput, $bufferedInput); - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | boolean $hasOutputRow = false; - | - | // the last iteration of this loop is to emit an empty row if there is no matched rows. - | while ($iterator.hasNext() || !$hasOutputRow) { - |InternalRow $bufferedRow = $iterator.hasNext() ? - | (InternalRow) $iterator.next() : null; - |${condCheck.trim} - |$hasOutputRow = true; - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup + |$numOutput.add(1); + |${consume(ctx, resultVars)} """.stripMargin -} +val findNextJoinRows = s"findNextJoinRows($streamedInput, $bufferedInput)" +val thisPlan = ctx.addReferenceObj("plan", this) +val eagerCleanup = s"$thisPlan.cleanupResources();" joinType match { - case _: InnerLike => innerJoin - case LeftOuter | RightOuter => outerJoin + case _: InnerLike => +codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, + eagerCleanup) + case LeftOuter | RightOuter => +codegenOuter(streamedInput, findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, + ctx.freshName("hasOutputRow"), outputRow, eagerCleanup) + case LeftSemi => +codegenSemi(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, + ctx.freshName("hasOutputRow"), outputRow, eagerCleanup) case x => throw new IllegalArgumentException( s"SortMergeJoin.doProduce should not take $x as the JoinType") } } + /** + * Generates the code for Inner join. + */ + private def codegenInner( + findNextJoinRows: String, + beforeLoop: String, + matchIterator: String, + bufferedRow: String, + conditionCheck: String, + outputRow: String, + eagerCleanup: String): String = { +s""" + |while ($findNextJoinRows) { + | ${beforeLoop.trim} + | while ($matchIterator.hasNext()) { + |InternalRow $bufferedRow = (InternalRow) $matchIterator.next(); + |${conditionCheck.trim} + |$outputRow + | } + | if (shouldStop()) return; + |} + |$eagerCleanup + """.stripMargin + } + + /** + * Generates the code for Left or Right Outer join. + */ + private def codegenOuter( + streamedInput: String, + findNextJoinRows: String, + beforeLoop: String, + matchIterator: String, + bufferedRow: String, + conditionCheck: String, + hasOutputRow: String, + outputRow: String, + eagerCleanup: String): String = { +s""" + |while ($streamedInput.hasNext()) { + | $findNextJoinRows; + | ${beforeLoop.trim} + | boolean $hasOutputRow = false; + | + | // the last iteration of this loop is to emit an empty row if there is no matched rows. + | while ($matchIterator.hasNext() || !$hasOutputRow) { + |InternalRow $bufferedRow = $matchIterator.hasNext() ? + | (InternalRow) $matchIterator.next() : null; + |${conditionCheck.trim} + |$hasOutputRow = true; + |$outputRow + | } + | if (shouldStop()) return; + |} +
[GitHub] [spark] HyukjinKwon commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
HyukjinKwon commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840403834 LGTM2 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
AmplabJenkins removed a comment on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840404809 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840404805 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138486/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
AmplabJenkins commented on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840404809 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43020/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins removed a comment on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840404805 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138486/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840404804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840404804 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840405169 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
SparkQA commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840405132 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins removed a comment on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840405169 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43022/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
SparkQA commented on pull request #32535: URL: https://github.com/apache/spark/pull/32535#issuecomment-840406769 **[Test build #138503 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138503/testReport)** for PR 32535 at commit [`122d853`](https://github.com/apache/spark/commit/122d85371ba45e0395715793fc3ac798300768a1). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
SparkQA commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840406845 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn closed pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
yaooqinn closed pull request #32515: URL: https://github.com/apache/spark/pull/32515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
yaooqinn commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840410375 Thanks for the help ~ @dongjoon-hyun @HyukjinKwon @cloud-fan +1 for myself. merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
SparkQA commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840410483 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
AmplabJenkins commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840410504 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
AmplabJenkins removed a comment on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840410504 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43021/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840414987 **[Test build #138489 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138489/testReport)** for PR 32515 at commit [`ad18acc`](https://github.com/apache/spark/commit/ad18acca9e991251fa44d33f53e8c4648fbcdbb7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840286591 **[Test build #138489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138489/testReport)** for PR 32515 at commit [`ad18acc`](https://github.com/apache/spark/commit/ad18acca9e991251fa44d33f53e8c4648fbcdbb7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
ulysses-you commented on a change in pull request #32482: URL: https://github.com/apache/spark/pull/32482#discussion_r631671896 ## File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ## @@ -1554,4 +1554,65 @@ class CachedTableSuite extends QueryTest with SQLTestUtils assert(!spark.catalog.isCached(viewName)) } } + + test("SPARK-35332: Make cache plan disable configs configurable - check AQE") { +withSQLConf(SQLConf.SHUFFLE_PARTITIONS.key -> "2", + SQLConf.COALESCE_PARTITIONS_MIN_PARTITION_NUM.key -> "1", + SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { + + withTempView("t1", "t2", "t3") { +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { + sql("CACHE TABLE t1 as SELECT /*+ REPARTITION */ * FROM values(1) as t(c)") + assert(spark.table("t1").rdd.partitions.length == 2) + + withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "true") { +assert(spark.table("t1").rdd.partitions.length == 2) +sql("CACHE TABLE t2 as SELECT /*+ REPARTITION */ * FROM values(2) as t(c)") +assert(spark.table("t2").rdd.partitions.length == 1) + +withSQLConf(SQLConf.CAN_CHANGE_CACHED_PLAN_OUTPUT_PARTITIONING.key -> "false") { Review comment: Updated it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
SparkQA commented on pull request #32535: URL: https://github.com/apache/spark/pull/32535#issuecomment-840429681 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
SparkQA commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840430061 **[Test build #138492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138492/testReport)** for PR 32204 at commit [`a386788`](https://github.com/apache/spark/commit/a386788b44fb5255d2784ce423e3f879ba97f53c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
SparkQA removed a comment on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840291088 **[Test build #138492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138492/testReport)** for PR 32204 at commit [`a386788`](https://github.com/apache/spark/commit/a386788b44fb5255d2784ce423e3f879ba97f53c). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
SparkQA commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840430677 **[Test build #138501 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138501/testReport)** for PR 32478 at commit [`4509cbb`](https://github.com/apache/spark/commit/4509cbbbe00debe19b670c463c85fb7fe295c8c3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
SparkQA removed a comment on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840376956 **[Test build #138501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138501/testReport)** for PR 32478 at commit [`4509cbb`](https://github.com/apache/spark/commit/4509cbbbe00debe19b670c463c85fb7fe295c8c3). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
AmplabJenkins commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840436292 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138492/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
AmplabJenkins commented on pull request #32535: URL: https://github.com/apache/spark/pull/32535#issuecomment-840436294 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43023/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
AmplabJenkins commented on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840436297 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138501/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840436293 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138489/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32478: [SPARK-35063][SQL] Group exception messages in sql/catalyst
AmplabJenkins removed a comment on pull request #32478: URL: https://github.com/apache/spark/pull/32478#issuecomment-840436297 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138501/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840436293 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138489/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32535: [SPARK-35397][SQL] Replace sys.err usage with explicit exception type
AmplabJenkins removed a comment on pull request #32535: URL: https://github.com/apache/spark/pull/32535#issuecomment-840436294 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43023/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
AmplabJenkins removed a comment on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840436292 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138492/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
SparkQA commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-840438115 **[Test build #138504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138504/testReport)** for PR 32482 at commit [`2e8492b`](https://github.com/apache/spark/commit/2e8492baf2c9b6dba3bd635f2b3a25d288e7f3ab). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request #32536: Spark 35253 followup
LuciferYang opened a new pull request #32536: URL: https://github.com/apache/spark/pull/32536 ### What changes were proposed in this pull request? SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, `ClassBodyEvaluator` provides the `getBytecodes` method to get the mapping from `ClassFile.getThisClassName` to `ClassFile.toByteArray` directly in this version and we don't need to get this variable by reflection api anymore. So the main purpose of this pr is simplify the way to get classes from `ClassBodyEvaluator` in `CodeGenerator.updateAndGetCompilationStats` method. ### Why are the changes needed? Code simplification. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32536: [SPARK-35253][SQL][FOLLOWUP] Simplify the way to get classes from ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method
SparkQA commented on pull request #32536: URL: https://github.com/apache/spark/pull/32536#issuecomment-840440073 **[Test build #138505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138505/testReport)** for PR 32536 at commit [`738bdc6`](https://github.com/apache/spark/commit/738bdc6d3459c30399ac4efec6e4bef799611ddd). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #32536: [SPARK-35398][SQL] Simplify the way to get classes from ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method
LuciferYang commented on pull request #32536: URL: https://github.com/apache/spark/pull/32536#issuecomment-840441694 Waiting for CI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #32536: [SPARK-35398][SQL] Simplify the way to get classes from ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method
LuciferYang commented on pull request #32536: URL: https://github.com/apache/spark/pull/32536#issuecomment-840441827 cc @maropu @srowen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on a change in pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
xuanyuanking commented on a change in pull request #32272: URL: https://github.com/apache/spark/pull/32272#discussion_r631688986 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala ## @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.state + +import java.io.File +import java.nio.charset.StandardCharsets.UTF_8 +import java.nio.file.Files + +import scala.collection.Seq + +import com.fasterxml.jackson.annotation.JsonInclude.Include +import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper} +import com.fasterxml.jackson.module.scala.{DefaultScalaModule, ScalaObjectMapper} +import org.json4s.NoTypeHints +import org.json4s.jackson.Serialization + +/** + * Classes to represent metadata of checkpoints saved to DFS. Since this is converted to JSON, any + * changes to this MUST be backward-compatible. + */ +case class RocksDBCheckpointMetadata( +sstFiles: Seq[RocksDBSstFile], +logFiles: Seq[RocksDBLogFile], +numKeys: Long) { + import RocksDBCheckpointMetadata._ + + def json: String = { +// We turn this field into a null to avoid write a empty logFiles field in the json. +val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this +mapper.writeValueAsString(nullified) + } + + def prettyJson: String = Serialization.writePretty(this)(RocksDBCheckpointMetadata.format) Review comment: The only difference is the `logFiles` fields. Actually the `prettyJson` field is for providing a readable string for log. `json` field is for files writing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on a change in pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
xuanyuanking commented on a change in pull request #32272: URL: https://github.com/apache/spark/pull/32272#discussion_r631689200 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.state + +import java.io._ +import java.nio.charset.Charset + +import org.apache.commons.io.FileUtils + +import org.apache.spark._ +import org.apache.spark.util.Utils + +class RocksDBSuite extends SparkFunSuite { + + test("checkpoint metadata serde roundtrip") { +def checkJsonRoundtrip(metadata: RocksDBCheckpointMetadata, json: String): Unit = { + assert(metadata.json == json) + withTempDirectory { dir => +val file = new File(dir, "json") +FileUtils.write(file, s"v1\n$json", Charset.defaultCharset) +assert(metadata == RocksDBCheckpointMetadata.readFromFile(file)) + } +} +val sstFiles = Seq(RocksDBSstFile("1.sst", "1-uuid.sst", 12345678901234L)) +val logFiles = Seq(RocksDBLogFile("1.log", "1-uuid.log", 12345678901234L)) + +// scalastyle:off line.size.limit +// should always include sstFiles and numKeys +checkJsonRoundtrip( + RocksDBCheckpointMetadata(Seq.empty, 0L), + """{"sstFiles":[],"numKeys":0}""" +) +// shouldn't include the "logFiles" field in json when it's empty +checkJsonRoundtrip( + RocksDBCheckpointMetadata(sstFiles, 12345678901234L), + """{"sstFiles":[{"localFileName":"1.sst","dfsSstFileName":"1-uuid.sst","sizeBytes":12345678901234}],"numKeys":12345678901234}""" +) +checkJsonRoundtrip( + RocksDBCheckpointMetadata(sstFiles, logFiles, 12345678901234L), + """{"sstFiles":[{"localFileName":"1.sst","dfsSstFileName":"1-uuid.sst","sizeBytes":12345678901234}],"logFiles":[{"localFileName":"1.log","dfsLogFileName":"1-uuid.log","sizeBytes":12345678901234}],"numKeys":12345678901234}""") +// scalastyle:on line.size.limit + } + + private def withTempDirectory(f: File => Unit): Unit = { Review comment: Ah yes. Let me update. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on a change in pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
xuanyuanking commented on a change in pull request #32272: URL: https://github.com/apache/spark/pull/32272#discussion_r631689463 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala ## @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.streaming.state + +import java.io.File +import java.nio.charset.StandardCharsets.UTF_8 +import java.nio.file.Files + +import scala.collection.Seq + +import com.fasterxml.jackson.annotation.JsonInclude.Include +import com.fasterxml.jackson.databind.{DeserializationFeature, ObjectMapper} +import com.fasterxml.jackson.module.scala.{DefaultScalaModule, ScalaObjectMapper} +import org.json4s.NoTypeHints +import org.json4s.jackson.Serialization + +/** + * Classes to represent metadata of checkpoints saved to DFS. Since this is converted to JSON, any + * changes to this MUST be backward-compatible. + */ +case class RocksDBCheckpointMetadata( +sstFiles: Seq[RocksDBSstFile], +logFiles: Seq[RocksDBLogFile], +numKeys: Long) { + import RocksDBCheckpointMetadata._ + + def json: String = { +// We turn this field into a null to avoid write a empty logFiles field in the json. +val nullified = if (logFiles.isEmpty) this.copy(logFiles = null) else this Review comment: Yes, the `logFiles` field not always has value. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
SparkQA commented on pull request #32528: URL: https://github.com/apache/spark/pull/32528#issuecomment-840444115 **[Test build #138506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138506/testReport)** for PR 32528 at commit [`6282a09`](https://github.com/apache/spark/commit/6282a09d2b78aad102926f956a10d68272e05c8d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631690259 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -679,60 +704,124 @@ case class SortMergeJoinExec( (evaluateVariables(streamedVars), "") } -val thisPlan = ctx.addReferenceObj("plan", this) -val eagerCleanup = s"$thisPlan.cleanupResources();" - -lazy val innerJoin = +val beforeLoop = s""" - |while (findNextJoinRows($streamedInput, $bufferedInput)) { - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | while ($iterator.hasNext()) { - |InternalRow $bufferedRow = (InternalRow) $iterator.next(); - |${condCheck.trim} - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup - """.stripMargin - -lazy val outerJoin = { - val hasOutputRow = ctx.freshName("hasOutputRow") + |${streamedVarDecl.mkString("\n")} + |${streamedBeforeLoop.trim} + |scala.collection.Iterator $iterator = $matches.generateIterator(); + """.stripMargin +val outputRow = s""" - |while ($streamedInput.hasNext()) { - | findNextJoinRows($streamedInput, $bufferedInput); - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | boolean $hasOutputRow = false; - | - | // the last iteration of this loop is to emit an empty row if there is no matched rows. - | while ($iterator.hasNext() || !$hasOutputRow) { - |InternalRow $bufferedRow = $iterator.hasNext() ? - | (InternalRow) $iterator.next() : null; - |${condCheck.trim} - |$hasOutputRow = true; - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup + |$numOutput.add(1); + |${consume(ctx, resultVars)} """.stripMargin -} +val findNextJoinRows = s"findNextJoinRows($streamedInput, $bufferedInput)" +val thisPlan = ctx.addReferenceObj("plan", this) +val eagerCleanup = s"$thisPlan.cleanupResources();" joinType match { - case _: InnerLike => innerJoin - case LeftOuter | RightOuter => outerJoin + case _: InnerLike => +codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, Review comment: Actually after double checking, we need to do `beforeLoop.trim` as `beforeLoop` already has `stripMargin`, and has no trailing spaces. Also updated to avoid repeated `conditionCheck.trim` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
c21 commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631690259 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -679,60 +704,124 @@ case class SortMergeJoinExec( (evaluateVariables(streamedVars), "") } -val thisPlan = ctx.addReferenceObj("plan", this) -val eagerCleanup = s"$thisPlan.cleanupResources();" - -lazy val innerJoin = +val beforeLoop = s""" - |while (findNextJoinRows($streamedInput, $bufferedInput)) { - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | while ($iterator.hasNext()) { - |InternalRow $bufferedRow = (InternalRow) $iterator.next(); - |${condCheck.trim} - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup - """.stripMargin - -lazy val outerJoin = { - val hasOutputRow = ctx.freshName("hasOutputRow") + |${streamedVarDecl.mkString("\n")} + |${streamedBeforeLoop.trim} + |scala.collection.Iterator $iterator = $matches.generateIterator(); + """.stripMargin +val outputRow = s""" - |while ($streamedInput.hasNext()) { - | findNextJoinRows($streamedInput, $bufferedInput); - | ${streamedVarDecl.mkString("\n")} - | ${beforeLoop.trim} - | scala.collection.Iterator $iterator = $matches.generateIterator(); - | boolean $hasOutputRow = false; - | - | // the last iteration of this loop is to emit an empty row if there is no matched rows. - | while ($iterator.hasNext() || !$hasOutputRow) { - |InternalRow $bufferedRow = $iterator.hasNext() ? - | (InternalRow) $iterator.next() : null; - |${condCheck.trim} - |$hasOutputRow = true; - |$numOutput.add(1); - |${consume(ctx, resultVars)} - | } - | if (shouldStop()) return; - |} - |$eagerCleanup + |$numOutput.add(1); + |${consume(ctx, resultVars)} """.stripMargin -} +val findNextJoinRows = s"findNextJoinRows($streamedInput, $bufferedInput)" +val thisPlan = ctx.addReferenceObj("plan", this) +val eagerCleanup = s"$thisPlan.cleanupResources();" joinType match { - case _: InnerLike => innerJoin - case LeftOuter | RightOuter => outerJoin + case _: InnerLike => +codegenInner(findNextJoinRows, beforeLoop, iterator, bufferedRow, condCheck, outputRow, Review comment: Actually after double checking, we do not need to do `beforeLoop.trim` as `beforeLoop` already has `stripMargin`, and has no trailing spaces. Also updated to avoid repeated `conditionCheck.trim` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
SparkQA commented on pull request #32272: URL: https://github.com/apache/spark/pull/32272#issuecomment-84023 **[Test build #138507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138507/testReport)** for PR 32272 at commit [`3b91a26`](https://github.com/apache/spark/commit/3b91a268c356544c8b6f75e7c365e0bfa7459e67). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #32272: [SPARK-35172][SS] The implementation of RocksDBCheckpointMetadata
xuanyuanking commented on pull request #32272: URL: https://github.com/apache/spark/pull/32272#issuecomment-840445866 ``` there're some sorts of uncertainty during reviewing as there's no reference PR. In other words, we are reviewing methods which we don't have idea how these methods will be used. ``` ``` I'm also OK to review PRs one by one with uncertainty (with faith) and revisit all changes at the last phase. ``` Yes, agree on both. I propose that we can mark down the uncertain methods or the ones without the caller side for now in the PR. When I submitting the reference PR, I can link the comment to the newly created PRs. It should help to our review and make sure I don't miss to explain any uncertainty during the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
SparkQA removed a comment on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840310729 **[Test build #138497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138497/testReport)** for PR 32161 at commit [`bb5cd45`](https://github.com/apache/spark/commit/bb5cd4529b07b05b21cdaf878b06b61ad717be79). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
SparkQA commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840451449 **[Test build #138497 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138497/testReport)** for PR 32161 at commit [`bb5cd45`](https://github.com/apache/spark/commit/bb5cd4529b07b05b21cdaf878b06b61ad717be79). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840310366 **[Test build #138493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138493/testReport)** for PR 32515 at commit [`b8b54ea`](https://github.com/apache/spark/commit/b8b54ea9cb3bdbb8f50bdb260567dedd2af9fe1b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org