[GitHub] [spark] SparkQA commented on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
SparkQA commented on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840356642 **[Test build #138498 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138498/testReport)** for PR 32533 at commit [`0081246`](https://github.com/apache/spark/commit/008124671ed27cb4367c941a1f8b73cda76e13b0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
SparkQA removed a comment on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840339950 **[Test build #138498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138498/testReport)** for PR 32533 at commit [`0081246`](https://github.com/apache/spark/commit/008124671ed27cb4367c941a1f8b73cda76e13b0). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840242617 **[Test build #138482 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138482/testReport)** for PR 32515 at commit [`a79d76e`](https://github.com/apache/spark/commit/a79d76eda4e4fe262a57a32b6aa16079aead7b34). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840355450 **[Test build #138482 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138482/testReport)** for PR 32515 at commit [`a79d76e`](https://github.com/apache/spark/commit/a79d76eda4e4fe262a57a32b6aa16079aead7b34). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
SparkQA removed a comment on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840286781 **[Test build #138490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138490/testReport)** for PR 32292 at commit [`774bda1`](https://github.com/apache/spark/commit/774bda13487ab0823e20d0295c6e7108a5a62b83). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins removed a comment on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840347165 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138490/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840347165 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138490/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
SparkQA commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840346863 **[Test build #138490 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138490/testReport)** for PR 32292 at commit [`774bda1`](https://github.com/apache/spark/commit/774bda13487ab0823e20d0295c6e7108a5a62b83). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class TryEval(child: Expression) extends UnaryExpression with NullIntolerant ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
AmplabJenkins removed a comment on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840344422 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43017/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
SparkQA commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840344392 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43017/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
AmplabJenkins commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840344422 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43017/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32528: [SPARK-35350][SQL] Add code-gen for left semi sort merge join
cloud-fan commented on a change in pull request #32528: URL: https://github.com/apache/spark/pull/32528#discussion_r631598224 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -424,8 +424,18 @@ case class SortMergeJoinExec( // A list to hold all matched rows from buffered side. val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName +// Flag to only buffer first matched row, to avoid buffering unnecessary rows. +val onlyBufferFirstMatchedRow = (joinType, condition) match { + case (LeftSemi, None) => true + case _ => false +} +val inMemoryThreshold = + if (onlyBufferFirstMatchedRow) { Review comment: +1, `lazy val` can probably be `def` as the logic is super simple -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
AmplabJenkins removed a comment on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840343604 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43016/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
AmplabJenkins commented on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840343604 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43016/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
SparkQA commented on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840343550 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32501: [SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation
cloud-fan commented on a change in pull request #32501: URL: https://github.com/apache/spark/pull/32501#discussion_r631597395 ## File path: sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/util/CharVarcharCodegenUtils.java ## @@ -26,7 +27,7 @@ private static UTF8String trimTrailingSpaces( UTF8String inputStr, int numChars, int limit) { int numTailSpacesToTrim = numChars - limit; UTF8String trimmed = inputStr.trimTrailingSpaces(numTailSpacesToTrim); -if (trimmed.numChars() > limit) { +if (trimmed.numChars() > limit && !SQLConf.get().charVarcharAsString()) { Review comment: We don't need this now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #32207: [SPARK-35106] Avoid failing rename in HadoopMapReduceCommitProtocol with dynamic partition overwrite
cloud-fan commented on pull request #32207: URL: https://github.com/apache/spark/pull/32207#issuecomment-840342929 @YuzhouSun Can you help to take over this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins removed a comment on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840341104 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43015/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840341058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840341104 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43015/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
cloud-fan commented on a change in pull request #32527: URL: https://github.com/apache/spark/pull/32527#discussion_r631596017 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression { arguments: Seq[Expression], input: InternalRow, dataType: DataType): Any = { -val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) -if (needNullCheck && args.exists(_ == null)) { +var i = 0 +val len = arguments.length +while (i < len) { + evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object] + i += 1 +} +if (needNullCheck && evaluatedArgs.contains(null)) { // return null if one of arguments is null null } else { val ret = try { -method.invoke(obj, args: _*) +method.invoke(obj, evaluatedArgs: _*) } catch { Review comment: You are right. Another idea: `obj` from `InternalRow` are always of the same class, we can avoid this ``` @transient lazy val method = { val cls = targetObject.dataType match { case ObjectType(cls) => cls case StringType => classOf[UTF8String] case _: DecimalType => classOf[Decimal] ... } findMethod(cls, encodedFunctionName, argClasses) } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
SparkQA commented on pull request #32532: URL: https://github.com/apache/spark/pull/32532#issuecomment-840340027 **[Test build #138499 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138499/testReport)** for PR 32532 at commit [`8a97c30`](https://github.com/apache/spark/commit/8a97c304f8656e337f98948db3454b2dfd802414). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32533: [SPARK-35392][ML][PYTHON] Remove Flaky GMM Test in ml/clustering.py
SparkQA commented on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-840339950 **[Test build #138498 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138498/testReport)** for PR 32533 at commit [`0081246`](https://github.com/apache/spark/commit/008124671ed27cb4367c941a1f8b73cda76e13b0). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
AmplabJenkins removed a comment on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840339404 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43012/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
AmplabJenkins removed a comment on pull request #32448: URL: https://github.com/apache/spark/pull/32448#issuecomment-840339407 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138481/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840339408 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43013/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
AmplabJenkins removed a comment on pull request #32527: URL: https://github.com/apache/spark/pull/32527#issuecomment-840339402 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138480/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840339405 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43014/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
AmplabJenkins commented on pull request #32448: URL: https://github.com/apache/spark/pull/32448#issuecomment-840339407 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138481/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840339408 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43013/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
AmplabJenkins commented on pull request #32527: URL: https://github.com/apache/spark/pull/32527#issuecomment-840339402 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138480/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
AmplabJenkins commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840339404 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43012/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840339405 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43014/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
SparkQA commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840339374 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43017/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840337565 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840334344 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43013/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #32533: [SPARK-35392][ML][PYTHON] remove Flaky GMM Test in ml/clustering.py
zhengruifeng commented on pull request #32533: URL: https://github.com/apache/spark/pull/32533#issuecomment-84040 ping @HyukjinKwon @srowen @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request #32533: [SPARK-35392][ML][PYTHON] remove Flaky GMM Test in ml/clustering.py
zhengruifeng opened a new pull request #32533: URL: https://github.com/apache/spark/pull/32533 ### What changes were proposed in this pull request? remove the check of `summary.logLikelihood` in ml/clustering.py ### Why are the changes needed? 1, this GMM test is quite Flaky, it tend to fail if: - change number of partitions; - just change the way to compute the sum of weights; - change the underlying BLAS impl 2, for now, just disable it, we need to use another test in the future. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? remaining testsuites -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WangGuangxin commented on pull request #31967: [SPARK-34819][SQL] MapType supports orderable semantics
WangGuangxin commented on pull request #31967: URL: https://github.com/apache/spark/pull/31967#issuecomment-840331912 > @WangGuangxin If you cannot keep working on it, is it okay that I take this over? Sure, I'm stuck with something else, you can take this over if you have time. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao opened a new pull request #32532: [SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke`
sunchao opened a new pull request #32532: URL: https://github.com/apache/spark/pull/32532 ### What changes were proposed in this pull request? Move hash map lookup operation out of `InvokeLike.invoke` since it doesn't depend on the input. ### Why are the changes needed? We shouldn't need to look up the hash map for every input row evaluated by `InvokeLike.invoke` since it doesn't depend on input. This could speed up the performance a bit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cfmcgrady closed pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate
cfmcgrady closed pull request #32488: URL: https://github.com/apache/spark/pull/32488 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #32523: [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs.
HyukjinKwon closed pull request #32523: URL: https://github.com/apache/spark/pull/32523 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32523: [SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs.
HyukjinKwon commented on pull request #32523: URL: https://github.com/apache/spark/pull/32523#issuecomment-840324993 Merged to master and branch-3.1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
SparkQA removed a comment on pull request #32448: URL: https://github.com/apache/spark/pull/32448#issuecomment-840218983 **[Test build #138481 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138481/testReport)** for PR 32448 at commit [`93b47d3`](https://github.com/apache/spark/commit/93b47d3f190369afdf5a2a5ae0ec0c6054b56c1b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
SparkQA commented on pull request #32448: URL: https://github.com/apache/spark/pull/32448#issuecomment-840324232 **[Test build #138481 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138481/testReport)** for PR 32448 at commit [`93b47d3`](https://github.com/apache/spark/commit/93b47d3f190369afdf5a2a5ae0ec0c6054b56c1b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
SparkQA removed a comment on pull request #32527: URL: https://github.com/apache/spark/pull/32527#issuecomment-840217408 **[Test build #138480 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138480/testReport)** for PR 32527 at commit [`2831f9c`](https://github.com/apache/spark/commit/2831f9c0b78aa21c6cc906370fb5069e166dbf39). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
SparkQA commented on pull request #32527: URL: https://github.com/apache/spark/pull/32527#issuecomment-840322575 **[Test build #138480 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138480/testReport)** for PR 32527 at commit [`2831f9c`](https://github.com/apache/spark/commit/2831f9c0b78aa21c6cc906370fb5069e166dbf39). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
SparkQA commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840318050 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43012/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
SparkQA commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840315107 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43012/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
HyukjinKwon edited a comment on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271 @itholic: 1. Please check the option **one by one** and see if each exists, and is matched. 2. Document general options in https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if there are missing ones 3. If you're going to do 2. separately in another PR and JIRA, don't remove general options in API documentations for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
HyukjinKwon edited a comment on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271 @itholic: 1. Please check the option **one by one** and see if each exists. 2. Document general options in https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if there are missing ones 3. If you're going to do 2. separately in another PR and JIRA, don't remove general options in API documentations for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
AmplabJenkins removed a comment on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840312669 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43008/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
AmplabJenkins commented on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840312669 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43008/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
SparkQA commented on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840312637 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
HyukjinKwon commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840312618 Same comment goes here too: https://github.com/apache/spark/pull/32204#issuecomment-840312271 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
AmplabJenkins removed a comment on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840312131 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43011/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
sunchao commented on a change in pull request #32527: URL: https://github.com/apache/spark/pull/32527#discussion_r631576884 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression { arguments: Seq[Expression], input: InternalRow, dataType: DataType): Any = { -val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) -if (needNullCheck && args.exists(_ == null)) { +var i = 0 +val len = arguments.length +while (i < len) { + evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object] + i += 1 +} +if (needNullCheck && evaluatedArgs.contains(null)) { // return null if one of arguments is null null } else { val ret = try { -method.invoke(obj, args: _*) +method.invoke(obj, evaluatedArgs: _*) } catch { Review comment: I'm not sure if we can do the similar thing in `Invoke.eval` though since `obj` in `obj.getClass.getMethod(functionName, argClasses: _*)` is different for each call. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
HyukjinKwon commented on pull request #32204: URL: https://github.com/apache/spark/pull/32204#issuecomment-840312271 @itholic: 1. Please check the option **one by one** and see if each exists. 2. Document general options in https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html if there are missing ones 3. If you're going to do this separately in a separate JIRA, don't remove general options in API documentations for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
AmplabJenkins commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840312131 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43011/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32531: [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
SparkQA commented on pull request #32531: URL: https://github.com/apache/spark/pull/32531#issuecomment-840312101 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
HyukjinKwon commented on a change in pull request #32204: URL: https://github.com/apache/spark/pull/32204#discussion_r631576139 ## File path: python/pyspark/sql/streaming.py ## @@ -504,105 +504,15 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, path : str string represents path to the JSON dataset, or RDD of Strings storing JSON objects. -schema : :class:`pyspark.sql.types.StructType` or str, optional Review comment: I don't think this is a general option -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32204: [SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page
HyukjinKwon commented on a change in pull request #32204: URL: https://github.com/apache/spark/pull/32204#discussion_r631575888 ## File path: python/pyspark/sql/readwriter.py ## @@ -1196,39 +1097,13 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm -- path : str the path in any Hadoop supported file system -mode : str, optional Review comment: mode is a general option -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840292938 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138477/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
SparkQA commented on pull request #32161: URL: https://github.com/apache/spark/pull/32161#issuecomment-840310729 **[Test build #138497 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138497/testReport)** for PR 32161 at commit [`bb5cd45`](https://github.com/apache/spark/commit/bb5cd4529b07b05b21cdaf878b06b61ad717be79). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
SparkQA commented on pull request #32410: URL: https://github.com/apache/spark/pull/32410#issuecomment-840310594 **[Test build #138496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138496/testReport)** for PR 32410 at commit [`4bca8ec`](https://github.com/apache/spark/commit/4bca8ecaec066ef19d04a12e134ba830320a2e0f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840310493 **[Test build #138495 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138495/testReport)** for PR 32494 at commit [`1573522`](https://github.com/apache/spark/commit/1573522541ceaf1e0b6e0eccb108b88f0fb1a4c6). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840310425 **[Test build #138494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138494/testReport)** for PR 32498 at commit [`b7a6cc7`](https://github.com/apache/spark/commit/b7a6cc71110fe8de45e8c74d487ebd23b7942f34). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840310366 **[Test build #138493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138493/testReport)** for PR 32515 at commit [`b8b54ea`](https://github.com/apache/spark/commit/b8b54ea9cb3bdbb8f50bdb260567dedd2af9fe1b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32161: [SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page.
HyukjinKwon commented on a change in pull request #32161: URL: https://github.com/apache/spark/pull/32161#discussion_r631575367 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -812,46 +812,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { /** * Loads a Parquet file, returning the result as a `DataFrame`. * - * You can set the following Parquet-specific option(s) for reading Parquet files: - * - * `mergeSchema` (default is the value specified in `spark.sql.parquet.mergeSchema`): sets - * whether we should merge schemas collected from all Parquet part-files. This will override - * `spark.sql.parquet.mergeSchema`. - * `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery. - * `modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: -MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - * `modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: -MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - * `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery - * `datetimeRebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.datetimeRebaseModeInRead`): the rebasing mode for the values - * of the `DATE`, `TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS` logical types from the Julian to - * Proleptic Gregorian calendar: - * - * `EXCEPTION` : Spark fails in reads of ancient dates/timestamps that are ambiguous - * between the two calendars - * `CORRECTED` : loading of dates/timestamps without rebasing - * `LEGACY` : perform rebasing of ancient dates/timestamps from the Julian to Proleptic - * Gregorian calendar - * - * - * `int96RebaseMode` (default is the value specified in the SQL config - * `spark.sql.parquet.int96RebaseModeInRead`): the rebasing mode for `INT96` timestamps - * from the Julian to Proleptic Gregorian calendar: - * - * `EXCEPTION` : Spark fails in reads of ancient `INT96` timestamps that are ambiguous - * between the two calendars - * `CORRECTED` : loading of timestamps without rebasing - * `LEGACY` : perform rebasing of ancient `INT96` timestamps from the Julian to Proleptic - * Gregorian calendar - * - * - * + * Parquet-specific option(s) for reading Parquet files can be found in + * https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#data-source-option";> + * Data Source Option in the version you use. Review comment: can you add the general options here too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
AmplabJenkins removed a comment on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840309736 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138488/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests
AmplabJenkins removed a comment on pull request #32520: URL: https://github.com/apache/spark/pull/32520#issuecomment-840309734 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138479/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins removed a comment on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840309741 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43010/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins removed a comment on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840309740 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138478/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins removed a comment on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840309738 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43009/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
AmplabJenkins commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840309740 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138478/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
AmplabJenkins commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840309741 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43010/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
AmplabJenkins commented on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840309736 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138488/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests
AmplabJenkins commented on pull request #32520: URL: https://github.com/apache/spark/pull/32520#issuecomment-840309734 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138479/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
AmplabJenkins commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840309738 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43009/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
shahidki31 commented on a change in pull request #32494: URL: https://github.com/apache/spark/pull/32494#discussion_r631574179 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/UnionEstimation.scala ## @@ -111,6 +111,44 @@ object UnionEstimation { AttributeMap.empty[ColumnStat] } +val attrToComputeNullCount = union.children.map(_.output).transpose.zipWithIndex.filter { + case (attrs, _) => attrs.zipWithIndex.forall { +case (attr, childIndex) => + val attrStats = union.children(childIndex).stats.attributeStats + attrStats.get(attr).isDefined && attrStats(attr).nullCount.isDefined + } +} + +val newAttrStats = if (attrToComputeNullCount.nonEmpty) { + val outputAttrStats = new ArrayBuffer[(Attribute, ColumnStat)]() + attrToComputeNullCount.foreach { +case (attrs, outputIndex) => + val colWithNullStatValues = attrs.zipWithIndex.foldLeft[Option[BigInt]](None) { +case (totalNullCount, (attr, childIndex)) => + val colStat = union.children(childIndex).stats.attributeStats(attr) + if (totalNullCount.isDefined) { Review comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840308059 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43009/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
SparkQA commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840305304 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43009/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32515: [SPARK-35380][SQL] Loading SparkSessionExtensions from ServiceLoader
HyukjinKwon commented on pull request #32515: URL: https://github.com/apache/spark/pull/32515#issuecomment-840303599 Looks okay to me too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32292: [SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE
SparkQA commented on pull request #32292: URL: https://github.com/apache/spark/pull/32292#issuecomment-840303409 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631566208 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { private def checkStats( plan: LogicalPlan, expectedStatsCboOn: Statistics, - expectedStatsCboOff: Statistics): Unit = { -withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + expectedStatsCboOff: Statistics, + extraConfigs: Map[String, String] = Map.empty): Unit = { + Review comment: Yes, removed the extra line -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
sunchao commented on a change in pull request #32527: URL: https://github.com/apache/spark/pull/32527#discussion_r631565642 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression { arguments: Seq[Expression], input: InternalRow, dataType: DataType): Any = { -val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) -if (needNullCheck && args.exists(_ == null)) { +var i = 0 +val len = arguments.length +while (i < len) { + evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object] + i += 1 +} +if (needNullCheck && evaluatedArgs.contains(null)) { // return null if one of arguments is null null } else { val ret = try { -method.invoke(obj, args: _*) +method.invoke(obj, evaluatedArgs: _*) } catch { Review comment: Yea let me try it. In the profiling after this PR, `HashMap.get` takes 7.82% from the entire `invoke` call so it seems worthwhile to do this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests
SparkQA removed a comment on pull request #32520: URL: https://github.com/apache/spark/pull/32520#issuecomment-840197479 **[Test build #138479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138479/testReport)** for PR 32520 at commit [`299abb5`](https://github.com/apache/spark/commit/299abb537bf715506d77079b65a4704a04a2829f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32520: [SPARK-35385][SQL][TESTS] Skip duplicate queries in the TPCDS-related tests
SparkQA commented on pull request #32520: URL: https://github.com/apache/spark/pull/32520#issuecomment-840300886 **[Test build #138479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138479/testReport)** for PR 32520 at commit [`299abb5`](https://github.com/apache/spark/commit/299abb537bf715506d77079b65a4704a04a2829f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631565143 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -283,14 +326,17 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { private def checkStats( plan: LogicalPlan, expectedStatsCboOn: Statistics, - expectedStatsCboOff: Statistics): Unit = { -withSQLConf(SQLConf.CBO_ENABLED.key -> "true") { + expectedStatsCboOff: Statistics, + extraConfigs: Map[String, String] = Map.empty): Unit = { + Review comment: I am not sure I understand you here. Do we need to directly put the histogram configs inside this method? By default histogram is disabled and number of bins default value is 254. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564790 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -77,12 +92,21 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(4), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step") { val range = Range(-10, -20, -2, None) +val histogramBins = new Array[HistogramBin](3) +histogramBins(0) = HistogramBin(-18.0, -16.0, 2) +histogramBins(1) = HistogramBin(-16.0, -12.0, 2) +histogramBins(2) = HistogramBin(-12.0, -10.0, 1) Review comment: Added assert to check if `range.numElements` and `ndv` are same ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(-10), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step where end minus start not divisible by step") { + val range = Range(-10, -20, -3, None) + +val histogramBins = new Array[HistogramBin](3) +histogramBins(0) = HistogramBin(-19.0, -16.0, 2) +histogramBins(1) = HistogramBin(-16.0, -13.0, 1) +histogramBins(2) = HistogramBin(-13.0, -10.0, 1) Review comment: Updated -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564612 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,38 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +var lowerIndex = percentileArray.head +var lowerBinValue = getRangeValue(0) +percentileArray.tail.zipWithIndex.foreach { case (upperIndex, binId) => + // Integer index for upper and lower values in the bin. + val upperIndexPos = math.ceil(upperIndex).toInt - 1 + val lowerIndexPos = math.ceil(lowerIndex).toInt - 1 + + val upperBinValue = getRangeValue(math.max(upperIndexPos, 0)) + val ndv = math.max(upperIndexPos - lowerIndexPos, 1) + binArray(binId) = HistogramBin(lowerBinValue, upperBinValue, ndv) + + lowerBinValue = upperBinValue + lowerIndex = upperIndex +} +Histogram(height, binArray) + } + + // Utility method to compute histogram + private def getRangeValue(index: Int): Long = { Review comment: Done ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/BasicStatsEstimationSuite.scala ## @@ -97,12 +121,24 @@ class BasicStatsEstimationSuite extends PlanTest with StatsEstimationTestBase { max = Some(-10), nullCount = Some(0), maxLen = Some(LongType.defaultSize), -avgLen = Some(LongType.defaultSize)) -checkStats(range, expectedStatsCboOn = rangeStats, expectedStatsCboOff = rangeStats) +avgLen = Some(LongType.defaultSize), +histogram = histogram) +val extraConfig = Map(SQLConf.HISTOGRAM_ENABLED.key -> "true", + SQLConf.HISTOGRAM_NUM_BINS.key -> "3") +checkStats(range, expectedStatsCboOn = rangeStats, + expectedStatsCboOff = rangeStats, extraConfig) } test("range with negative step where end minus start not divisible by step") { + Review comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shahidki31 commented on a change in pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
shahidki31 commented on a change in pull request #32498: URL: https://github.com/apache/spark/pull/32498#discussion_r631564557 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -789,6 +797,38 @@ case class Range( } } + private def computeHistogramStatistics() = { +val numBins = conf.histogramNumBins +val height = numElements.toDouble / numBins +val percentileArray = (0 to numBins).map(i => i * height).toArray + +val binArray = new Array[HistogramBin](numBins) +var lowerIndex = percentileArray.head +var lowerBinValue = getRangeValue(0) Review comment: Yes, updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
SparkQA removed a comment on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840286547 **[Test build #138488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138488/testReport)** for PR 32516 at commit [`702629c`](https://github.com/apache/spark/commit/702629ccead13baba006eab8a6340b49722bf60a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32516: [SPARK-35364][PYTHON] Renaming the existing Koalas related codes
SparkQA commented on pull request #32516: URL: https://github.com/apache/spark/pull/32516#issuecomment-840298542 **[Test build #138488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138488/testReport)** for PR 32516 at commit [`702629c`](https://github.com/apache/spark/commit/702629ccead13baba006eab8a6340b49722bf60a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
cloud-fan commented on a change in pull request #32527: URL: https://github.com/apache/spark/pull/32527#discussion_r631561074 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression { arguments: Seq[Expression], input: InternalRow, dataType: DataType): Any = { -val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) -if (needNullCheck && args.exists(_ == null)) { +var i = 0 +val len = arguments.length +while (i < len) { + evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object] + i += 1 +} +if (needNullCheck && evaluatedArgs.contains(null)) { // return null if one of arguments is null null } else { val ret = try { -method.invoke(obj, args: _*) +method.invoke(obj, evaluatedArgs: _*) } catch { Review comment: We can do the similar thing in `Invoke.eval` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32527: [SPARK-35384][SQL] Improve performance for InvokeLike.invoke
cloud-fan commented on a change in pull request #32527: URL: https://github.com/apache/spark/pull/32527#discussion_r631560800 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/objects/objects.scala ## @@ -127,13 +128,18 @@ trait InvokeLike extends Expression with NonSQLExpression { arguments: Seq[Expression], input: InternalRow, dataType: DataType): Any = { -val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) -if (needNullCheck && args.exists(_ == null)) { +var i = 0 +val len = arguments.length +while (i < len) { + evaluatedArgs(i) = arguments(i).eval(input).asInstanceOf[Object] + i += 1 +} +if (needNullCheck && evaluatedArgs.contains(null)) { // return null if one of arguments is null null } else { val ret = try { -method.invoke(obj, args: _*) +method.invoke(obj, evaluatedArgs: _*) } catch { Review comment: Can we also improve the last piece? ``` val boxedClass = ScalaReflection.typeBoxedJavaMapping.get(dataType) if (boxedClass.isDefined) { boxedClass.get.cast(ret) } else { ret } ``` We can create a function for it ``` private lazy val boxing: Any => Any = ScalaReflection.typeBoxedJavaMapping.get(dataType).map(_.cast(_)).getOrElse(identity) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA removed a comment on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840190295 **[Test build #138478 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138478/testReport)** for PR 32494 at commit [`c929124`](https://github.com/apache/spark/commit/c929124f5ce2045da43314941d513b57ce9d553a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32494: [SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation
SparkQA commented on pull request #32494: URL: https://github.com/apache/spark/pull/32494#issuecomment-840293326 **[Test build #138478 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138478/testReport)** for PR 32494 at commit [`c929124`](https://github.com/apache/spark/commit/c929124f5ce2045da43314941d513b57ce9d553a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
AmplabJenkins commented on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840292938 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138477/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32498: [SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation
SparkQA removed a comment on pull request #32498: URL: https://github.com/apache/spark/pull/32498#issuecomment-840190243 **[Test build #138477 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138477/testReport)** for PR 32498 at commit [`0bb49b3`](https://github.com/apache/spark/commit/0bb49b3a15b4bf2c59916cce91d5aba285812079). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org