[GitHub] [spark] SparkQA removed a comment on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
SparkQA removed a comment on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894936793 **[Test build #142206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142206/testReport)** for PR 33665 at commit [`3496a4d`](https://github.com/apache/spark/commit/3496a4dfa53d3ab9cddf8c085c61d8b86757eda5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
SparkQA removed a comment on pull request #33683: URL: https://github.com/apache/spark/pull/33683#issuecomment-894972160 **[Test build #142209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142209/testReport)** for PR 33683 at commit [`09f6aeb`](https://github.com/apache/spark/commit/09f6aeb529ee390b2e6c61c9e780fe41b89fc41c). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33682: [WIP][SPARK-36456][CORE][SQL][STRUCTURED STREAMING] Clean up compilation warnings related to `method closeQuietly in class IOUtils is
SparkQA removed a comment on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894936706 **[Test build #142205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142205/testReport)** for PR 33682 at commit [`6bd69d0`](https://github.com/apache/spark/commit/6bd69d05a5d298ba664ef8a46fe98e4b6e6736f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
SparkQA commented on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894991717 **[Test build #142206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142206/testReport)** for PR 33665 at commit [`3496a4d`](https://github.com/apache/spark/commit/3496a4dfa53d3ab9cddf8c085c61d8b86757eda5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #33686: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
AngersZh commented on pull request #33686: URL: https://github.com/apache/spark/pull/33686#issuecomment-894987234 FYI @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33682: [WIP][SPARK-36456][CORE][SQL][STRUCTURED STREAMING] Clean up compilation warnings related to `method closeQuietly in class IOUtils is depreca
SparkQA commented on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894986988 **[Test build #142205 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142205/testReport)** for PR 33682 at commit [`6bd69d0`](https://github.com/apache/spark/commit/6bd69d05a5d298ba664ef8a46fe98e4b6e6736f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BlockSavedOnDecommissionedBlockManagerException(blockId: BlockId)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #33685: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
AngersZh commented on pull request #33685: URL: https://github.com/apache/spark/pull/33685#issuecomment-894985919 FYI @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #33686: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
AngersZh opened a new pull request #33686: URL: https://github.com/apache/spark/pull/33686 ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
HeartSaVioR commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894985611 retest this, please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a change in pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
HeartSaVioR commented on a change in pull request #33683: URL: https://github.com/apache/spark/pull/33683#discussion_r684946042 ## File path: docs/structured-streaming-programming-guide.md ## @@ -1814,6 +1814,23 @@ Specifically for built-in HDFS state store provider, users can check the state s it is best if cache missing count is minimized that means Spark won't waste too much time on loading checkpointed state. User can increase Spark locality waiting configurations to avoid loading state store providers in different executors across batches. +### RocksDB state store implementation + +As of Spark 3.2, we add a new build-in state store implementation, RocksDB state store provider. + +The current build-in HDFS state store provider has two major drawbacks: + +* The amount of state that can be maintained is limited by the heap size of the executors +* State expiration by watermark and/or timeouts require full scans over all the data + +The RocksDB-based State Store implementation can address these drawbacks: + +* RocksDB can serve data from the disk with a configurable amount of non-JVM memory. +* Sorting keys using the appropriate column should avoid full scans to find the to-be-dropped keys. Review comment: Please correct me if I'm missing; while this could be something we can evaluate and address, this is not true at least for now. We don't distinguish event time field in state store. Prefix scan is the only thing we leverage sorted key for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #33685: [SPARK-36086][SQL] CollapseProject project replace alias should use origin column name
AngersZh opened a new pull request #33685: URL: https://github.com/apache/spark/pull/33685 ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name *** FAILED *** (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33684: [WIP][SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported.
SparkQA commented on pull request #33684: URL: https://github.com/apache/spark/pull/33684#issuecomment-894979263 **[Test build #142210 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142210/testReport)** for PR 33684 at commit [`4f0df82`](https://github.com/apache/spark/commit/4f0df82fd0c391944d754d3ff72ea0681e024d31). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
HeartSaVioR commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894978776 cc. @viirya @xuanyuanking -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request #33684: [WIP][SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported.
beliefer opened a new pull request #33684: URL: https://github.com/apache/spark/pull/33684 ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/M/')) -- !query schema struct> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/M/')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
AmplabJenkins commented on pull request #33683: URL: https://github.com/apache/spark/pull/33683#issuecomment-894976783 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142209/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
SparkQA commented on pull request #33683: URL: https://github.com/apache/spark/pull/33683#issuecomment-894976646 **[Test build #142209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142209/testReport)** for PR 33683 at commit [`09f6aeb`](https://github.com/apache/spark/commit/09f6aeb529ee390b2e6c61c9e780fe41b89fc41c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894976016 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46721/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894973226 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46720/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
SparkQA commented on pull request #33683: URL: https://github.com/apache/spark/pull/33683#issuecomment-894972160 **[Test build #142209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142209/testReport)** for PR 33683 at commit [`09f6aeb`](https://github.com/apache/spark/commit/09f6aeb529ee390b2e6c61c9e780fe41b89fc41c). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking commented on pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
xuanyuanking commented on pull request #33683: URL: https://github.com/apache/spark/pull/33683#issuecomment-894972015 cc @HeartSaVioR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] xuanyuanking opened a new pull request #33683: [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
xuanyuanking opened a new pull request #33683: URL: https://github.com/apache/spark/pull/33683 ### What changes were proposed in this pull request? Add the document for the new RocksDBStateStoreProvider. ### Why are the changes needed? User guide for the new feature. ### Does this PR introduce _any_ user-facing change? No, doc only. ### How was this patch tested? Doc only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
AmplabJenkins removed a comment on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894907498 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
AmplabJenkins commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894971483 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46719/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894971443 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46719/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
AmplabJenkins removed a comment on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894923739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33682: [WIP][SPARK-36456][CORE][SQL][STRUCTURED STREAMING] Clean up compilation warnings related to `method closeQuietly in class IOUt
AmplabJenkins removed a comment on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894970697 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46717/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33682: [WIP][SPARK-36456][CORE][SQL][STRUCTURED STREAMING] Clean up compilation warnings related to `method closeQuietly in class IOUtils is d
AmplabJenkins commented on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894970697 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46717/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
AmplabJenkins commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894970696 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142208/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33682: [WIP][SPARK-36456][CORE][SQL][STRUCTURED STREAMING] Clean up compilation warnings related to `method closeQuietly in class IOUtils is depreca
SparkQA commented on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894968632 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46717/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SaurabhChawla100 commented on pull request #33679: [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
SaurabhChawla100 commented on pull request #33679: URL: https://github.com/apache/spark/pull/33679#issuecomment-894968492 > I thought @maropu is still working on this? (#32552) I was not aware, that there is already a jira for this map issue, Yes this PR (https://github.com/apache/spark/pull/32552) will fix the use case that I am trying in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SaurabhChawla100 commented on a change in pull request #33679: [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
SaurabhChawla100 commented on a change in pull request #33679: URL: https://github.com/apache/spark/pull/33679#discussion_r684924662 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala ## @@ -97,13 +97,18 @@ object InterpretedOrdering { object RowOrdering extends CodeGeneratorWithInterpretedFallback[Seq[SortOrder], BaseOrdering] { /** - * Returns true iff the data type can be ordered (i.e. can be sorted). + * Returns true if the data type can be ordered (i.e. can be sorted). */ - def isOrderable(dataType: DataType): Boolean = dataType match { + def isOrderable(dataType: DataType, Review comment: @HyukjinKwon - Thanks for checking this PR. Yes we can wait for this PR https://github.com/apache/spark/pull/32552. The fix in this will work with group by, order by , partition by in window. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA removed a comment on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894959813 **[Test build #142208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142208/testReport)** for PR 33681 at commit [`2d2d67f`](https://github.com/apache/spark/commit/2d2d67f1db83e88155162f990832bd37e8fef714). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894964222 **[Test build #142208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142208/testReport)** for PR 33681 at commit [`2d2d67f`](https://github.com/apache/spark/commit/2d2d67f1db83e88155162f990832bd37e8fef714). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
AmplabJenkins commented on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894960812 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46718/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
SparkQA commented on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894960783 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46718/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894959813 **[Test build #142208 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142208/testReport)** for PR 33681 at commit [`2d2d67f`](https://github.com/apache/spark/commit/2d2d67f1db83e88155162f990832bd37e8fef714). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #33679: [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
c21 commented on pull request #33679: URL: https://github.com/apache/spark/pull/33679#issuecomment-894958699 I thought @maropu is still working on this? (https://github.com/apache/spark/pull/32552) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
HeartSaVioR commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894958524 Addressed Java port. This is now ready to review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 removed a comment on pull request #33680: [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
c21 removed a comment on pull request #33680: URL: https://github.com/apache/spark/pull/33680#issuecomment-894957859 LTGM as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #33680: [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
c21 commented on pull request #33680: URL: https://github.com/apache/spark/pull/33680#issuecomment-894957859 LTGM as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894955039 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46719/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
AmplabJenkins commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894953783 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142207/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33682: [WIP][SPARK-36456][CORE][SQL] Clean up compilation warnings related to `method closeQuietly in class IOUtils is deprecated`
SparkQA commented on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894953320 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46717/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA removed a comment on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894936794 **[Test build #142207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142207/testReport)** for PR 33646 at commit [`a5c169a`](https://github.com/apache/spark/commit/a5c169a6dc3ca4ecdadae0beabac8565def7a4f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
SparkQA commented on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894948305 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46718/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894945197 **[Test build #142207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142207/testReport)** for PR 33646 at commit [`a5c169a`](https://github.com/apache/spark/commit/a5c169a6dc3ca4ecdadae0beabac8565def7a4f8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on a change in pull request #33615: [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
venkata91 commented on a change in pull request #33615: URL: https://github.com/apache/spark/pull/33615#discussion_r684896710 ## File path: docs/configuration.md ## @@ -3134,3 +3134,111 @@ The stage level scheduling feature allows users to specify task and executor res This is only available for the RDD API in Scala, Java, and Python. It is available on YARN and Kubernetes when dynamic allocation is enabled. See the [YARN](running-on-yarn.html#stage-level-scheduling-overview) page or [Kubernetes](running-on-kubernetes.html#stage-level-scheduling-overview) page for more implementation details. See the `RDD.withResources` and `ResourceProfileBuilder` API's for using this feature. The current implementation acquires new executors for each `ResourceProfile` created and currently has to be an exact match. Spark does not try to fit tasks into an executor that require a different ResourceProfile than the executor was created with. Executors that are not in use will idle timeout with the dynamic allocation logic. The default configuration for this feature is to only allow one ResourceProfile per stage. If the user associates more then 1 ResourceProfile to an RDD, Spark will throw an exception by default. See config `spark.scheduler.resource.profileMergeConflicts` to control that behavior. The current merge strategy Spark implements when `spark.scheduler.resource.profileMergeConflicts` is enabled is a simple max of each resource within the conflicting ResourceProfiles. Spark will create a new ResourceProfile with the max of each of the resources. + +# Push-based shuffle overview + +Push based shuffle helps improve the reliability and performance of spark shuffle. It takes a best-effort approach to push the shuffle blocks generated by the map tasks to remote shuffle services to be merged per shuffle partition. Reduce tasks fetch a combination of merged shuffle partitions and original shuffle blocks as their input data, resulting in converting small random disk reads by shuffle services into large sequential reads. Possibility of better data locality for reduce tasks additionally helps minimize network IO. + + Currently push-based shuffle is only supported for Spark on YARN with external shuffle service. + +### Shuffle server side configuration options + + +Property NameDefaultMeaningSince Version + + spark.shuffle.push.server.mergedShuffleFileManagerImpl + + org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager Review comment: We would still have the issue of `$` for the config value which is an issue. Let me file a PR to handle that. Looked at the other configs in the same page, it seems like just using line break (`` is how the config values are word wrapped earlier, so would follow the same approach. Also, I tried changing CSS for `td` to `word-wrap` seems to work fine, but it also changes the other tables which seems to be disruptive so won't go that route. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on a change in pull request #33615: [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
venkata91 commented on a change in pull request #33615: URL: https://github.com/apache/spark/pull/33615#discussion_r684896710 ## File path: docs/configuration.md ## @@ -3134,3 +3134,111 @@ The stage level scheduling feature allows users to specify task and executor res This is only available for the RDD API in Scala, Java, and Python. It is available on YARN and Kubernetes when dynamic allocation is enabled. See the [YARN](running-on-yarn.html#stage-level-scheduling-overview) page or [Kubernetes](running-on-kubernetes.html#stage-level-scheduling-overview) page for more implementation details. See the `RDD.withResources` and `ResourceProfileBuilder` API's for using this feature. The current implementation acquires new executors for each `ResourceProfile` created and currently has to be an exact match. Spark does not try to fit tasks into an executor that require a different ResourceProfile than the executor was created with. Executors that are not in use will idle timeout with the dynamic allocation logic. The default configuration for this feature is to only allow one ResourceProfile per stage. If the user associates more then 1 ResourceProfile to an RDD, Spark will throw an exception by default. See config `spark.scheduler.resource.profileMergeConflicts` to control that behavior. The current merge strategy Spark implements when `spark.scheduler.resource.profileMergeConflicts` is enabled is a simple max of each resource within the conflicting ResourceProfiles. Spark will create a new ResourceProfile with the max of each of the resources. + +# Push-based shuffle overview + +Push based shuffle helps improve the reliability and performance of spark shuffle. It takes a best-effort approach to push the shuffle blocks generated by the map tasks to remote shuffle services to be merged per shuffle partition. Reduce tasks fetch a combination of merged shuffle partitions and original shuffle blocks as their input data, resulting in converting small random disk reads by shuffle services into large sequential reads. Possibility of better data locality for reduce tasks additionally helps minimize network IO. + + Currently push-based shuffle is only supported for Spark on YARN with external shuffle service. + +### Shuffle server side configuration options + + +Property NameDefaultMeaningSince Version + + spark.shuffle.push.server.mergedShuffleFileManagerImpl + + org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager Review comment: Ok. I tried changing CSS for `td` to `word-wrap` works fine, but it also changes the other tables. We would still have the issue of `$` for the config value which is an issue. Let me file another PR to handle that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894941957 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46716/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
AmplabJenkins commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894941971 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46716/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33665: [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
SparkQA commented on pull request #33665: URL: https://github.com/apache/spark/pull/33665#issuecomment-894936793 **[Test build #142206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142206/testReport)** for PR 33665 at commit [`3496a4d`](https://github.com/apache/spark/commit/3496a4dfa53d3ab9cddf8c085c61d8b86757eda5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894936794 **[Test build #142207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142207/testReport)** for PR 33646 at commit [`a5c169a`](https://github.com/apache/spark/commit/a5c169a6dc3ca4ecdadae0beabac8565def7a4f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33682: [WIP][SPARK-36456][CORE][SQL] Clean up compilation warnings related to `method closeQuietly in class IOUtils is deprecated`
SparkQA commented on pull request #33682: URL: https://github.com/apache/spark/pull/33682#issuecomment-894936706 **[Test build #142205 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142205/testReport)** for PR 33682 at commit [`6bd69d0`](https://github.com/apache/spark/commit/6bd69d05a5d298ba664ef8a46fe98e4b6e6736f8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
AmplabJenkins commented on pull request #33672: URL: https://github.com/apache/spark/pull/33672#issuecomment-894935085 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46715/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #30483: [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc
LuciferYang commented on pull request #30483: URL: https://github.com/apache/spark/pull/30483#issuecomment-894933284 > Hi, @LuciferYang . Are you still interested in this? Yes, I'm still interested in it. I'll try to update it to master first. However, since `ParquetFileReader` no longer has a non `deprecated` constructor support pass footer, we need to make a decision together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA removed a comment on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894922082 **[Test build #142204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142204/testReport)** for PR 33681 at commit [`8e7db98`](https://github.com/apache/spark/commit/8e7db98d43e4921293211c3cb13718b41712e4b2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33661: [SPARK-36431][SQL] Support comparison of ANSI intervals with different fields
AngersZh commented on a change in pull request #33661: URL: https://github.com/apache/spark/pull/33661#discussion_r684888789 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala ## @@ -840,10 +840,17 @@ abstract class BinaryComparison extends BinaryOperator with Predicate { final override val nodePatterns: Seq[TreePattern] = Seq(BINARY_COMPARISON) - override def checkInputDataTypes(): TypeCheckResult = super.checkInputDataTypes() match { -case TypeCheckResult.TypeCheckSuccess => - TypeUtils.checkForOrderingExpr(left.dataType, this.getClass.getSimpleName) -case failure => failure + override def checkInputDataTypes(): TypeCheckResult = { +val matched = (left.dataType, right.dataType) match { + case (l: DayTimeIntervalType, r: DayTimeIntervalType) => TypeCheckResult.TypeCheckSuccess Review comment: > It's a bit weird that we allow different types in binary comparison. Can we fix the type coercion instead? e.g. `TypeCoercion.findTightestCommonType`. This is also more general, we can also support `coalesce(interval1, interval2)` If not change here, `checkInputDataTypes` will be false and expression is not resolved for below UT. ``` checkEvaluation(EqualTo( Literal.create(10, YearMonthIntervalType(YearMonthIntervalType.YEAR, YearMonthIntervalType.MONTH)), Literal.create(10, YearMonthIntervalType(YearMonthIntervalType.MONTH, YearMonthIntervalType.MONTH))), true) ``` If reasonable? or should I just remove the UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894931813 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46716/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
SparkQA commented on pull request #33672: URL: https://github.com/apache/spark/pull/33672#issuecomment-894931586 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46715/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang opened a new pull request #33682: [SPARK-36456][CORE][SQL] Clean up compilation warnings related to `method closeQuietly in class IOUtils is deprecated`
LuciferYang opened a new pull request #33682: URL: https://github.com/apache/spark/pull/33682 ### What changes were proposed in this pull request? There are some compilation warnings related to `method closeQuietly in class IOUtils is deprecated`. This pr introduce a new method named `closeQuietly` to `org.apache.spark.util.Utils` refer to `org.apache.commons.io.IOUtils` and use this method to clean up the depredation usage of `IOUtils.closeQuietly`. ### Why are the changes needed? Clean up compilation warnings related to `method closeQuietly in class IOUtils is deprecated` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #33680: [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
viirya commented on a change in pull request #33680: URL: https://github.com/apache/spark/pull/33680#discussion_r684885485 ## File path: sql/core/src/test/scala/org/apache/spark/sql/ExplainSuite.scala ## @@ -460,7 +460,7 @@ class ExplainSuite extends ExplainSuiteHelper with DisableAdaptiveExecutionSuite "parquet" -> "|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]", "orc" -> -"|PushedFilters: \\[.*\\(id\\), .*\\(value\\), .*\\(id,1\\), .*\\(value,2\\)\\]", +"|PushedFilters: \\[IsNotNull\\(value\\), GreaterThan\\(value,2\\)\\]", Review comment: Oh, I see. #30652 also only updated this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on a change in pull request #33615: [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
venkata91 commented on a change in pull request #33615: URL: https://github.com/apache/spark/pull/33615#discussion_r684885417 ## File path: docs/configuration.md ## @@ -3134,3 +3134,111 @@ The stage level scheduling feature allows users to specify task and executor res This is only available for the RDD API in Scala, Java, and Python. It is available on YARN and Kubernetes when dynamic allocation is enabled. See the [YARN](running-on-yarn.html#stage-level-scheduling-overview) page or [Kubernetes](running-on-kubernetes.html#stage-level-scheduling-overview) page for more implementation details. See the `RDD.withResources` and `ResourceProfileBuilder` API's for using this feature. The current implementation acquires new executors for each `ResourceProfile` created and currently has to be an exact match. Spark does not try to fit tasks into an executor that require a different ResourceProfile than the executor was created with. Executors that are not in use will idle timeout with the dynamic allocation logic. The default configuration for this feature is to only allow one ResourceProfile per stage. If the user associates more then 1 ResourceProfile to an RDD, Spark will throw an exception by default. See config `spark.scheduler.resource.profileMergeConflicts` to control that behavior. The current merge strategy Spark implements when `spark.scheduler.resource.profileMergeConflicts` is enabled is a simple max of each resource within the conflicting ResourceProfiles. Spark will create a new ResourceProfile with the max of each of the resources. + +# Push-based shuffle overview + +Push based shuffle helps improve the reliability and performance of spark shuffle. It takes a best-effort approach to push the shuffle blocks generated by the map tasks to remote shuffle services to be merged per shuffle partition. Reduce tasks fetch a combination of merged shuffle partitions and original shuffle blocks as their input data, resulting in converting small random disk reads by shuffle services into large sequential reads. Possibility of better data locality for reduce tasks additionally helps minimize network IO. + + Currently push-based shuffle is only supported for Spark on YARN with external shuffle service. + +### Shuffle server side configuration options + + +Property NameDefaultMeaningSince Version + + spark.shuffle.push.server.mergedShuffleFileManagerImpl + + org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager Review comment: @mridulm Yeah I haven't tried that yet. But still the config key names are quite long, it would still not make the readability issue go away. Shouldn't this be handled at the CSS layer? Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on pull request #33652: [SPARK-36425] [PYSPARK][ML] Support CrossValidatorModel get standard deviation of metrics for each paramMap
WeichenXu123 commented on pull request #33652: URL: https://github.com/apache/spark/pull/33652#issuecomment-894927990 Thanks @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
AmplabJenkins commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894923739 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142204/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894923723 **[Test build #142204 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142204/testReport)** for PR 33681 at commit [`8e7db98`](https://github.com/apache/spark/commit/8e7db98d43e4921293211c3cb13718b41712e4b2). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
SparkQA commented on pull request #33672: URL: https://github.com/apache/spark/pull/33672#issuecomment-894922050 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46715/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
SparkQA commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894922082 **[Test build #142204 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142204/testReport)** for PR 33681 at commit [`8e7db98`](https://github.com/apache/spark/commit/8e7db98d43e4921293211c3cb13718b41712e4b2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
AmplabJenkins commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894920862 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46714/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
HeartSaVioR commented on pull request #33681: URL: https://github.com/apache/spark/pull/33681#issuecomment-894919228 I probably need to convert the Scala example to Java one as well. Marking this as draft for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request #33681: [SPARK-36455][SS] Provide an example of complex session window via flatMapGroupsWithState
HeartSaVioR opened a new pull request #33681: URL: https://github.com/apache/spark/pull/33681 ### What changes were proposed in this pull request? This PR proposes to add a new example of complex sessionization, which leverages flatMapGroupsWithState. ### Why are the changes needed? We have replaced an example of sessionization from flatMapGroupsWithState to native support of session window. Given there are still use cases on sessionization which native support of session window cannot cover, it would be nice if we can demonstrate such case. It will also be used as an example of flatMapGroupsWithState. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually tested. Example data is given in class doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
SparkQA commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894918438 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46714/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #33603: [SPARK-36376][SQL] Collapse repartitions if there is a project between them
wangyum commented on a change in pull request #33603: URL: https://github.com/apache/spark/pull/33603#discussion_r684877019 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -913,10 +913,17 @@ object CollapseRepartition extends Rule[LogicalPlan] { case (false, true) => if (r.numPartitions >= child.numPartitions) child else r case _ => r.copy(child = child.child) } +case r @ Repartition(_, _, p @ Project(_, child: RepartitionOperation)) => Review comment: Not all `RepartitionOperation` can be removed. Sometimes repartition before joining and filtering to increase parallelism, especially before `BroadcastNestedLoopJoin`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ekoifman commented on pull request #33641: [SPARK-36416][SQL] Add SQL metrics to AdaptiveSparkPlanExec for BHJs and Skew joins
ekoifman commented on pull request #33641: URL: https://github.com/apache/spark/pull/33641#issuecomment-894917212 @cloud-fan could you take a look when you have a chance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33679: [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
HyukjinKwon commented on a change in pull request #33679: URL: https://github.com/apache/spark/pull/33679#discussion_r684872927 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala ## @@ -97,13 +97,18 @@ object InterpretedOrdering { object RowOrdering extends CodeGeneratorWithInterpretedFallback[Seq[SortOrder], BaseOrdering] { /** - * Returns true iff the data type can be ordered (i.e. can be sorted). + * Returns true if the data type can be ordered (i.e. can be sorted). */ - def isOrderable(dataType: DataType): Boolean = dataType match { + def isOrderable(dataType: DataType, Review comment: Should we fix https://github.com/apache/spark/pull/31967 first? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33679: [SPARK-36452][SQL]: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
HyukjinKwon commented on a change in pull request #33679: URL: https://github.com/apache/spark/pull/33679#discussion_r684872676 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ordering.scala ## @@ -97,13 +97,18 @@ object InterpretedOrdering { object RowOrdering extends CodeGeneratorWithInterpretedFallback[Seq[SortOrder], BaseOrdering] { /** - * Returns true iff the data type can be ordered (i.e. can be sorted). + * Returns true if the data type can be ordered (i.e. can be sorted). Review comment: iff is an abbreviation of if and only if -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
AmplabJenkins commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894914281 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/46713/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894914275 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46713/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
HyukjinKwon closed pull request #33634: URL: https://github.com/apache/spark/pull/33634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
HyukjinKwon edited a comment on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894909826 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
HyukjinKwon commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894909826 Merged to master and branch-3.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
HyukjinKwon commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894909570 @itholic can you check the test failures? https://github.com/itholic/spark/runs/3276195042?check_suite_focus=true -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
SparkQA commented on pull request #33672: URL: https://github.com/apache/spark/pull/33672#issuecomment-894907782 **[Test build #142203 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142203/testReport)** for PR 33672 at commit [`cc0e8c8`](https://github.com/apache/spark/commit/cc0e8c84ef657af188536fda6c8663f8abdb923b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
AmplabJenkins commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894907498 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142201/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
AmplabJenkins commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894907497 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/142202/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
SparkQA commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894906084 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46714/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA removed a comment on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894892998 **[Test build #142201 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142201/testReport)** for PR 33646 at commit [`11ea4a2`](https://github.com/apache/spark/commit/11ea4a241e5ab5c6f9ecdcbc4d6eb041712e6be7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
SparkQA removed a comment on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894894751 **[Test build #142202 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142202/testReport)** for PR 33634 at commit [`dc8f0e8`](https://github.com/apache/spark/commit/dc8f0e8719e1ae522cf0b6ecc03e913b105bfa20). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #33680: [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
huaxingao commented on pull request #33680: URL: https://github.com/apache/spark/pull/33680#issuecomment-894905042 > is it possible to add a test? @viirya Thanks for taking a look. The reason that I didn't add a new test is because we have partition pruning test with both partition filters and data filters here https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala#L734 For pushed down filters display in explain, i modified the expected result in `ExplainSuite`. Any suggestions for the new tests to add? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894904735 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46713/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684865386 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2943,6 +2943,34 @@ class DataFrameSuite extends QueryTest .withSequenceColumn("default_index").collect().map(_.getLong(0)) assert(ids.toSet === Range(0, 10).toSet) } + + test("SPARK-35320 DataFrame read in Json format should fail if the schema provided " + +"by the user contains a MapType with a key type different of StringType") { + +Seq((MapType(IntegerType, StringType), """{"1": "test"}"""), + (StructType(Seq(StructField("test", MapType(IntegerType, StringType, +test": {"1": "test"}"""), + (ArrayType(MapType(IntegerType, StringType)), """[{"1": "test"}]"""), + (MapType(StringType, MapType(IntegerType, StringType)), """{"key": {"1" : "test"}}""") +).foreach { case (schema, jsonData) => + withTempDir { dir => +val colName = "col" +val msg = "can only contain StringType as a key type for a MapType" + +val thrown1 = intercept[AnalysisException] ( + spark.read.schema(StructType(Seq(StructField(colName, schema +.json(Seq(jsonData).toDS()).collect()) +assert(thrown1.getMessage contains msg) + +val jsonDir = new File(dir, "json").getCanonicalPath +Seq(jsonData).toDF(colName).write.json(jsonDir) +val thrown2 = intercept[AnalysisException] ( + spark.read.schema(StructType(Seq(StructField(colName, schema +.json(jsonDir).collect()) +assert(thrown2.getMessage contains msg) Review comment: Can we call it with explicit `.`? See also https://github.com/databricks/scala-style-guide#infix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684865288 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2943,6 +2943,34 @@ class DataFrameSuite extends QueryTest .withSequenceColumn("default_index").collect().map(_.getLong(0)) assert(ids.toSet === Range(0, 10).toSet) } + + test("SPARK-35320 DataFrame read in Json format should fail if the schema provided " + +"by the user contains a MapType with a key type different of StringType") { + +Seq((MapType(IntegerType, StringType), """{"1": "test"}"""), + (StructType(Seq(StructField("test", MapType(IntegerType, StringType, +test": {"1": "test"}"""), + (ArrayType(MapType(IntegerType, StringType)), """[{"1": "test"}]"""), + (MapType(StringType, MapType(IntegerType, StringType)), """{"key": {"1" : "test"}}""") +).foreach { case (schema, jsonData) => + withTempDir { dir => +val colName = "col" +val msg = "can only contain StringType as a key type for a MapType" + +val thrown1 = intercept[AnalysisException] ( Review comment: ```suggestion val thrown1 = intercept[AnalysisException]( ``` ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2943,6 +2943,34 @@ class DataFrameSuite extends QueryTest .withSequenceColumn("default_index").collect().map(_.getLong(0)) assert(ids.toSet === Range(0, 10).toSet) } + + test("SPARK-35320 DataFrame read in Json format should fail if the schema provided " + +"by the user contains a MapType with a key type different of StringType") { + +Seq((MapType(IntegerType, StringType), """{"1": "test"}"""), + (StructType(Seq(StructField("test", MapType(IntegerType, StringType, +test": {"1": "test"}"""), + (ArrayType(MapType(IntegerType, StringType)), """[{"1": "test"}]"""), + (MapType(StringType, MapType(IntegerType, StringType)), """{"key": {"1" : "test"}}""") +).foreach { case (schema, jsonData) => + withTempDir { dir => +val colName = "col" +val msg = "can only contain StringType as a key type for a MapType" + +val thrown1 = intercept[AnalysisException] ( + spark.read.schema(StructType(Seq(StructField(colName, schema +.json(Seq(jsonData).toDS()).collect()) +assert(thrown1.getMessage contains msg) + +val jsonDir = new File(dir, "json").getCanonicalPath +Seq(jsonData).toDF(colName).write.json(jsonDir) +val thrown2 = intercept[AnalysisException] ( Review comment: ```suggestion val thrown2 = intercept[AnalysisException]( ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684865267 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2943,6 +2943,34 @@ class DataFrameSuite extends QueryTest .withSequenceColumn("default_index").collect().map(_.getLong(0)) assert(ids.toSet === Range(0, 10).toSet) } + + test("SPARK-35320 DataFrame read in Json format should fail if the schema provided " + +"by the user contains a MapType with a key type different of StringType") { + +Seq((MapType(IntegerType, StringType), """{"1": "test"}"""), + (StructType(Seq(StructField("test", MapType(IntegerType, StringType, +test": {"1": "test"}"""), + (ArrayType(MapType(IntegerType, StringType)), """[{"1": "test"}]"""), + (MapType(StringType, MapType(IntegerType, StringType)), """{"key": {"1" : "test"}}""") Review comment: ```suggestion Seq( (MapType(IntegerType, StringType), """{"1": "test"}"""), (StructType(Seq(StructField("test", MapType(IntegerType, StringType, test": {"1": "test"}"""), (ArrayType(MapType(IntegerType, StringType)), """[{"1": "test"}]"""), (MapType(StringType, MapType(IntegerType, StringType)), """{"key": {"1" : "test"}}""") ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684865162 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala ## @@ -2943,6 +2943,34 @@ class DataFrameSuite extends QueryTest .withSequenceColumn("default_index").collect().map(_.getLong(0)) assert(ids.toSet === Range(0, 10).toSet) } + + test("SPARK-35320 DataFrame read in Json format should fail if the schema provided " + Review comment: Could we make the test title simpler? e.g.) Reading JSON with string key in a map should fail -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684865045 ## File path: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ## @@ -402,7 +402,11 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * @since 2.0.0 */ @scala.annotation.varargs - def json(paths: String*): DataFrame = format("json").load(paths : _*) + def json(paths: String*): DataFrame = { +userSpecifiedSchema.foreach( + ExprUtils.checkJsonSchema(_).foreach(e => throw new AnalysisException(e))) Review comment: I think we would have to throw an exception via `QueryCompilationErrors`, cc @karenfeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33634: [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
SparkQA commented on pull request #33634: URL: https://github.com/apache/spark/pull/33634#issuecomment-894903536 **[Test build #142202 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142202/testReport)** for PR 33634 at commit [`dc8f0e8`](https://github.com/apache/spark/commit/dc8f0e8719e1ae522cf0b6ecc03e913b105bfa20). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class BlockSavedOnDecommissionedBlockManagerException(blockId: BlockId)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on a change in pull request #33672: URL: https://github.com/apache/spark/pull/33672#discussion_r684864831 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala ## @@ -561,15 +561,9 @@ case class JsonToStructs( override def checkInputDataTypes(): TypeCheckResult = nullableSchema match { case _: StructType | _: ArrayType | _: MapType => - val invalidMapType = nullableSchema.existsRecursively(dataType => dataType match { -case MapType(keyType, _, _) if keyType != StringType => true -case _ => false - }) - if (invalidMapType) { -TypeCheckResult.TypeCheckFailure( - s"Input schema ${nullableSchema.catalogString} can only contain StringType " + -"as a key type for a MapType.") - } else { + ExprUtils.checkJsonSchema(nullableSchema).map{ Review comment: ```suggestion ExprUtils.checkJsonSchema(nullableSchema).map { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33672: [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
HyukjinKwon commented on pull request #33672: URL: https://github.com/apache/spark/pull/33672#issuecomment-894903199 ok to test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33673: [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
HyukjinKwon commented on pull request #33673: URL: https://github.com/apache/spark/pull/33673#issuecomment-894903074 @yeshengm mind fixing the test failures? BTW, why do we need to fix them if it doesn't cause any user facing behaviour? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33646: [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
SparkQA commented on pull request #33646: URL: https://github.com/apache/spark/pull/33646#issuecomment-894901360 **[Test build #142201 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/142201/testReport)** for PR 33646 at commit [`11ea4a2`](https://github.com/apache/spark/commit/11ea4a241e5ab5c6f9ecdcbc4d6eb041712e6be7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #33680: [SPARK-36454][SQL] Not push down partition filter to ORCScan for DSv2
viirya commented on pull request #33680: URL: https://github.com/apache/spark/pull/33680#issuecomment-894900342 Hmm, is it possible to add a test? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org