[jira] [Resolved] (SPARK-47234) Upgrade Scala to 2.13.13
[ https://issues.apache.org/jira/browse/SPARK-47234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47234. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45342 [https://github.com/apache/spark/pull/45342] > Upgrade Scala to 2.13.13 > > > Key: SPARK-47234 > URL: https://issues.apache.org/jira/browse/SPARK-47234 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47406) Handle TIMESTAMP and DATETIME in MYSQLDialect
[ https://issues.apache.org/jira/browse/SPARK-47406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47406: --- Labels: pull-request-available (was: ) > Handle TIMESTAMP and DATETIME in MYSQLDialect > -- > > Key: SPARK-47406 > URL: https://issues.apache.org/jira/browse/SPARK-47406 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47406) Handle TIMESTAMP and DATETIME in MYSQLDialect
Kent Yao created SPARK-47406: Summary: Handle TIMESTAMP and DATETIME in MYSQLDialect Key: SPARK-47406 URL: https://issues.apache.org/jira/browse/SPARK-47406 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47405) Remove `JLine 2` dependency
[ https://issues.apache.org/jira/browse/SPARK-47405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47405: --- Labels: pull-request-available (was: ) > Remove `JLine 2` dependency > > > Key: SPARK-47405 > URL: https://issues.apache.org/jira/browse/SPARK-47405 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47405) Remove `JLine 2` dependency
Dongjoon Hyun created SPARK-47405: - Summary: Remove `JLine 2` dependency Key: SPARK-47405 URL: https://issues.apache.org/jira/browse/SPARK-47405 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47366) Implement parse_json
[ https://issues.apache.org/jira/browse/SPARK-47366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47366. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45479 [https://github.com/apache/spark/pull/45479] > Implement parse_json > > > Key: SPARK-47366 > URL: https://issues.apache.org/jira/browse/SPARK-47366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47366) Implement parse_json
[ https://issues.apache.org/jira/browse/SPARK-47366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47366: --- Assignee: Chenhao Li > Implement parse_json > > > Key: SPARK-47366 > URL: https://issues.apache.org/jira/browse/SPARK-47366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47404) Add hooks to release the ANTLR DFA cache after parsing SQL
[ https://issues.apache.org/jira/browse/SPARK-47404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47404: --- Labels: pull-request-available (was: ) > Add hooks to release the ANTLR DFA cache after parsing SQL > -- > > Key: SPARK-47404 > URL: https://issues.apache.org/jira/browse/SPARK-47404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mark Jarvin >Priority: Major > Labels: pull-request-available > > ANTLR builds a DFA cache while parsing to speed up parsing of similar future > inputs. However, this cache is never cleared and can only grow. Extremely > large SQL inputs can lead to very large DFA caches (>20GiB in one extreme > case I've seen). > Spark’s ANTLR SQL parser is derived from the Presto ANTLR SQL Parser, and > Presto has added hooks to be able to clear this DFA cache. I think Spark > should have similar hooks. > References: > * > [https://github.com/antlr/antlr4/blob/f08a19bbb202b02a521f84d99e661e386bea8625/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java#L163-L171] > * > [https://stackoverflow.com/questions/28017135/why-antlr4-parsers-accumulates-atnconfig-objects?rq=2] > * [https://github.com/antlr/antlr4/issues/499] > * > [https://github.com/trinodb/trino/pull/3186/files#diff-75b81ed5837578d1af42fcc91e4094a247138e5da6edb9d9e4b67d53247b8ca9] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47404) Add hooks to release the ANTLR DFA cache after parsing SQL
Mark Jarvin created SPARK-47404: --- Summary: Add hooks to release the ANTLR DFA cache after parsing SQL Key: SPARK-47404 URL: https://issues.apache.org/jira/browse/SPARK-47404 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Mark Jarvin ANTLR builds a DFA cache while parsing to speed up parsing of similar future inputs. However, this cache is never cleared and can only grow. Extremely large SQL inputs can lead to very large DFA caches (>20GiB in one extreme case I've seen). Spark’s ANTLR SQL parser is derived from the Presto ANTLR SQL Parser, and Presto has added hooks to be able to clear this DFA cache. I think Spark should have similar hooks. References: * [https://github.com/antlr/antlr4/blob/f08a19bbb202b02a521f84d99e661e386bea8625/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java#L163-L171] * [https://stackoverflow.com/questions/28017135/why-antlr4-parsers-accumulates-atnconfig-objects?rq=2] * [https://github.com/antlr/antlr4/issues/499] * [https://github.com/trinodb/trino/pull/3186/files#diff-75b81ed5837578d1af42fcc91e4094a247138e5da6edb9d9e4b67d53247b8ca9] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45376) Add netty-tcnative-boringssl-static dependency
[ https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45376: -- Summary: Add netty-tcnative-boringssl-static dependency (was: [CORE] Add netty-tcnative-boringssl-static dependency) > Add netty-tcnative-boringssl-static dependency > -- > > Key: SPARK-45376 > URL: https://issues.apache.org/jira/browse/SPARK-45376 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hasnain Lakhani >Assignee: Hasnain Lakhani >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add the boringssl dependency which is needed for SSL functionality to work, > and provide the network common test helper to other test modules which need > to test SSL functionality -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-47342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-47342. - > Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE > -- > > Key: SPARK-47342 > URL: https://issues.apache.org/jira/browse/SPARK-47342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-47342 ] Dongjoon Hyun deleted comment on SPARK-47342: --- was (Author: dongjoon): Issue resolved by pull request 45471 [https://github.com/apache/spark/pull/45471] > Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE > -- > > Key: SPARK-47342 > URL: https://issues.apache.org/jira/browse/SPARK-47342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
[ https://issues.apache.org/jira/browse/SPARK-47342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827258#comment-17827258 ] Dongjoon Hyun commented on SPARK-47342: --- Thank you for providing the context. > Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE > -- > > Key: SPARK-47342 > URL: https://issues.apache.org/jira/browse/SPARK-47342 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47402) Upgrade `ZooKeeper` to 3.9.2
[ https://issues.apache.org/jira/browse/SPARK-47402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47402. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45524 [https://github.com/apache/spark/pull/45524] > Upgrade `ZooKeeper` to 3.9.2 > > > Key: SPARK-47402 > URL: https://issues.apache.org/jira/browse/SPARK-47402 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46305) Remove the special Zookeeper version in the `streaming-kafka-0-10` and `sql-kafka-0-10` modules
[ https://issues.apache.org/jira/browse/SPARK-46305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46305: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Remove the special Zookeeper version in the `streaming-kafka-0-10` and > `sql-kafka-0-10` modules > > > Key: SPARK-46305 > URL: https://issues.apache.org/jira/browse/SPARK-46305 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39420) Support ANALYZE TABLE on v2 tables
[ https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827234#comment-17827234 ] Felipe commented on SPARK-39420: Hi. The PR [https://github.com/apache/spark/pull/4] has been closed without merging it. Anyone has updates about it? Any chances to be implemented? > Support ANALYZE TABLE on v2 tables > -- > > Key: SPARK-39420 > URL: https://issues.apache.org/jira/browse/SPARK-39420 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.1, 3.3.4 >Reporter: Felipe >Priority: Major > Labels: pull-request-available > > According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE > TABLE in Delta, we need to add the missing APIs in Spark to allow a data > source to report the file set to calculate the stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39420) Support ANALYZE TABLE on v2 tables
[ https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felipe updated SPARK-39420: --- Affects Version/s: 3.3.4 3.5.1 > Support ANALYZE TABLE on v2 tables > -- > > Key: SPARK-39420 > URL: https://issues.apache.org/jira/browse/SPARK-39420 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0, 3.5.1, 3.3.4 >Reporter: Felipe >Priority: Major > Labels: pull-request-available > > According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE > TABLE in Delta, we need to add the missing APIs in Spark to allow a data > source to report the file set to calculate the stats. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47396) Add a general mapping for TIME WITHOUT TIME ZONE to TimestampNTZType
[ https://issues.apache.org/jira/browse/SPARK-47396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47396. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45519 [https://github.com/apache/spark/pull/45519] > Add a general mapping for TIME WITHOUT TIME ZONE to TimestampNTZType > > > Key: SPARK-47396 > URL: https://issues.apache.org/jira/browse/SPARK-47396 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827206#comment-17827206 ] Raza Jafri commented on SPARK-47398: In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101 might be lost or the worst case could be that we try to look for the said Exec and throw an exception Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > Labels: pull-request-available > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 was: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101 might be lost or the worst case could be that we try to look for the said Exec and throw an exception Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > Labels: pull-request-available > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47398: --- Labels: pull-request-available (was: ) > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > Labels: pull-request-available > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we > replace the `InMemoryTableScanExec` with our version which does some > optimizations. This could cause a problem as the benefits of SPARK-42101 > might be lost or the worst case could be that we try to look for the said > Exec and throw an exception > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101 might be lost or the worst case could be that we try to look for the said Exec and throw an exception Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. was: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101 might be lost or worst case could be that we try to look for the Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we > replace the `InMemoryTableScanExec` with our version which does some > optimizations. This could cause a problem as the benefits of SPARK-42101 > might be lost or the worst case could be that we try to look for the said > Exec and throw an exception > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101 might be lost or worst case could be that we try to look for the Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. was: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101, might be lost Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we > replace the `InMemoryTableScanExec` with our version which does some > optimizations. This could cause a problem as the benefits of SPARK-42101 > might be lost or worst case could be that we try to look for the > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101, we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace the `InMemoryTableScanExec` with our version which does some optimizations. This could cause a problem as the benefits of SPARK-42101, might be lost Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. was: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101, we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we > replace the `InMemoryTableScanExec` with our version which does some > optimizations. This could cause a problem as the benefits of SPARK-42101, > might be lost > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47387) Remove some unused error classes
[ https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47387. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45509 [https://github.com/apache/spark/pull/45509] > Remove some unused error classes > > > Key: SPARK-47387 > URL: https://issues.apache.org/jira/browse/SPARK-47387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47387) Remove some unused error classes
[ https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47387: Assignee: BingKun Pan > Remove some unused error classes > > > Key: SPARK-47387 > URL: https://issues.apache.org/jira/browse/SPARK-47387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. was: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in TableCacheQueryStageExec. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for Exchange by matching against ShuffleExchangeLike and BroadcastExchangeLike. Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101 we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in TableCacheQueryStageExec. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for Exchange by matching against ShuffleExchangeLike and BroadcastExchangeLike. Looking at the current code, I propose the trait to be as {code:java} trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } {code} This is just based on what I know about how AQE is using it. was: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. Looking at the current code, I propose the trait to be as ``` trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } ``` This is just based on what I know about how AQE is using it. > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101 we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in > TableCacheQueryStageExec. To accomplish this we are currently matching on the > Exec, I am proposing that we should match on a trait instead just like how we > do it for Exchange by matching against ShuffleExchangeLike and > BroadcastExchangeLike. > > Looking at the current code, I propose the trait to be as > {code:java} > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } {code} > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Description: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in `TableCacheQueryStageExec`. To accomplish this we are currently matching on the Exec, I am proposing that we should match on a trait instead just like how we do it for `Exchange` by matching against `ShuffleExchangeLike` and `BroadcastExchangeLike`. Looking at the current code, I propose the trait to be as ``` trait InMemoryTableScanLike extends LeafExecNode { /** * Returns whether the cache buffer is loaded */ def isMaterialized: Boolean /** * Returns the actual cached RDD without filters and serialization of row/columnar. */ def baseCacheRDD(): RDD[CachedBatch] /** * Returns the runtime statistics after shuffle materialization. */ def runtimeStatistics: Statistics } ``` This is just based on what I know about how AQE is using it. was: As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101 we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in > `TableCacheQueryStageExec`. To accomplish this we are currently matching on > the Exec, I am proposing that we should match on a trait instead just like > how we do it for `Exchange` by matching against `ShuffleExchangeLike` and > `BroadcastExchangeLike`. > > Looking at the current code, I propose the trait to be as > > ``` > trait InMemoryTableScanLike extends LeafExecNode { > /** > * Returns whether the cache buffer is loaded > */ > def isMaterialized: Boolean > /** > * Returns the actual cached RDD without filters and serialization of > row/columnar. > */ > def baseCacheRDD(): RDD[CachedBatch] > /** > * Returns the runtime statistics after shuffle materialization. > */ > def runtimeStatistics: Statistics > } > ``` > This is just based on what I know about how AQE is using it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
[ https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raza Jafri updated SPARK-47398: --- Target Version/s: (was: 4.0.0) > AQE doesn't allow for extension of InMemoryTableScanExec > > > Key: SPARK-47398 > URL: https://issues.apache.org/jira/browse/SPARK-47398 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 3.5.1 >Reporter: Raza Jafri >Priority: Major > > As part of SPARK-42101 we added support to AQE for handling > InMemoryTableScanExec. > This change directly references `InMemoryTableScanExec` which limits users > from extending the caching functionality that was added as part of > SPARK-32274 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47401) Update `YuniKorn` docs with v1.5
[ https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47401. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45523 [https://github.com/apache/spark/pull/45523] > Update `YuniKorn` docs with v1.5 > > > Key: SPARK-47401 > URL: https://issues.apache.org/jira/browse/SPARK-47401 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47401) Update `YuniKorn` docs with v1.5
[ https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47401: - Assignee: Dongjoon Hyun > Update `YuniKorn` docs with v1.5 > > > Key: SPARK-47401 > URL: https://issues.apache.org/jira/browse/SPARK-47401 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46210) Update `YuniKorn` docs with v1.4
[ https://issues.apache.org/jira/browse/SPARK-46210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46210: -- Summary: Update `YuniKorn` docs with v1.4 (was: Update YuniKorn docs with v1.4) > Update `YuniKorn` docs with v1.4 > > > Key: SPARK-46210 > URL: https://issues.apache.org/jira/browse/SPARK-46210 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47401) Update `YuniKorn` docs with v1.5
[ https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47401: --- Labels: pull-request-available (was: ) > Update `YuniKorn` docs with v1.5 > > > Key: SPARK-47401 > URL: https://issues.apache.org/jira/browse/SPARK-47401 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47401) Update `YuniKorn` docs with v1.5
Dongjoon Hyun created SPARK-47401: - Summary: Update `YuniKorn` docs with v1.5 Key: SPARK-47401 URL: https://issues.apache.org/jira/browse/SPARK-47401 Project: Spark Issue Type: Sub-task Components: Documentation, Kubernetes Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47400) Upgrade `gcs-connector` to 2.2.20
[ https://issues.apache.org/jira/browse/SPARK-47400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47400. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45521 [https://github.com/apache/spark/pull/45521] > Upgrade `gcs-connector` to 2.2.20 > - > > Key: SPARK-47400 > URL: https://issues.apache.org/jira/browse/SPARK-47400 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44197) Upgrade Hadoop to 3.3.6
[ https://issues.apache.org/jira/browse/SPARK-44197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44197: -- Fix Version/s: 4.0.0 (was: 3.5.0) > Upgrade Hadoop to 3.3.6 > --- > > Key: SPARK-44197 > URL: https://issues.apache.org/jira/browse/SPARK-44197 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47337) Bump DB2 Docker version to 11.5.8.0
[ https://issues.apache.org/jira/browse/SPARK-47337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-47337: -- Parent Issue: SPARK-47361 (was: SPARK-47046) > Bump DB2 Docker version to 11.5.8.0 > --- > > Key: SPARK-47337 > URL: https://issues.apache.org/jira/browse/SPARK-47337 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47337) Bump DB2 Docker version to 11.5.8.0
[ https://issues.apache.org/jira/browse/SPARK-47337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827152#comment-17827152 ] Dongjoon Hyun commented on SPARK-47337: --- I changed this subtask from SPARK-47046 to SPARK-47361 . > Bump DB2 Docker version to 11.5.8.0 > --- > > Key: SPARK-47337 > URL: https://issues.apache.org/jira/browse/SPARK-47337 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45850: -- Parent Issue: SPARK-47361 (was: SPARK-47046) > Upgrade oracle jdbc driver to 23.3.0.23.09 > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47384) Upgrade RoaringBitmap to 1.0.5
[ https://issues.apache.org/jira/browse/SPARK-47384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47384. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45507 [https://github.com/apache/spark/pull/45507] > Upgrade RoaringBitmap to 1.0.5 > -- > > Key: SPARK-47384 > URL: https://issues.apache.org/jira/browse/SPARK-47384 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47395) Add collate and collation to non-sql APIs
[ https://issues.apache.org/jira/browse/SPARK-47395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47395: --- Labels: pull-request-available (was: ) > Add collate and collation to non-sql APIs > - > > Key: SPARK-47395 > URL: https://issues.apache.org/jira/browse/SPARK-47395 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47399) Disable generated columns on expressions with collations
[ https://issues.apache.org/jira/browse/SPARK-47399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47399: --- Labels: pull-request-available (was: ) > Disable generated columns on expressions with collations > > > Key: SPARK-47399 > URL: https://issues.apache.org/jira/browse/SPARK-47399 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > > Changing the collation of a column or even just changing the ICU version > could lead to a differences in the resulting expression so it would be best > if we simply disable it for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47399) Disable generated columns on expressions with collations
Stefan Kandic created SPARK-47399: - Summary: Disable generated columns on expressions with collations Key: SPARK-47399 URL: https://issues.apache.org/jira/browse/SPARK-47399 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic Changing the collation of a column or even just changing the ICU version could lead to a differences in the resulting expression so it would be best if we simply disable it for now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec
Raza Jafri created SPARK-47398: -- Summary: AQE doesn't allow for extension of InMemoryTableScanExec Key: SPARK-47398 URL: https://issues.apache.org/jira/browse/SPARK-47398 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 3.5.0 Reporter: Raza Jafri As part of SPARK-42101 we added support to AQE for handling InMemoryTableScanExec. This change directly references `InMemoryTableScanExec` which limits users from extending the caching functionality that was added as part of SPARK-32274 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows
[ https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827135#comment-17827135 ] Martin Rueckl commented on SPARK-46876: --- [~doki] any chance to make progress on this? > Data is silently lost in Tab separated CSV with empty (whitespace) rows > --- > > Key: SPARK-46876 > URL: https://issues.apache.org/jira/browse/SPARK-46876 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Labels: pull-request-available > > When reading a tab separated file that contains lines that only contain tabs > (i.e. empty strings as values of the columns for that row), then these rows > will silently be skipped (as empty lines) and the resulting dataframe will > have less rows than expected. > This behavior is inconsistent with the behavior for e.g. semicolon separated > files, where the resulting dataframe will have a row with only empty string > values. > A minimal reproducible example would look like: > A minimal reproducible example: A file containing this > {code:java} > a\tb\tc\r\n > \t\t\r\n > 1\t2\t3{code} > will create a dataframe with one row (a=1,b=2,c=3) > whereas this > {code:java} > a;b;c\r\n > ;;\r\n > 1;2;3{code} > will read as two rows (first row contains empty strings) > I used the following pyspark command to read the dataframes > {code:java} > spark.read.option("header","true").option("sep","\t").csv(" file>").collect() > spark.read.option("header","true").option("sep",";").csv(" file>").collect() > {code} > I ran into this particularly on databricks (I assume they use the same > reader), but [this stack overflow > post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858] > indicates, that this is an old issue that may have been taken over from > databricks when their csv reader was adopted in SPARK-12420 > I recommend to at least add a respective test case to the CSV reader. > > Why is this behaviour a problem: > * It violates some of the core assumptions > ** a properly configured roundtrip via csv write/read should result in the > same set of rows > ** changing the csv separator (when everything is properly esacped) should > have no effect > Potential resolutions: > * When the configured delimiter consists of only whitespace > ** deactivate the "skip empty line feature" > ** or skip only lines that are completely empty (only a (carriage return) > newline) > * Change the skip empty line feature to only skip if the line is completely > empty (only contains a newlin) > ** this may break some user code that relies on the current behaviour -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47397) count_distinct ignores null values
[ https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Rueckl updated SPARK-47397: -- Description: The documentation states, that in group by and count statements, null values will not be ignored / form their own groups. !image-2024-03-14-16-13-03-107.png|width=491,height=373! However, the behavior of count_distinct does not account for nulls. Either the documentation or the implementation is wrong here... !image-2024-03-14-16-12-35-267.png! was: The documentation states, that in group by and count statements, null values will not be ignored / form their own groups. !image-2024-03-14-16-09-20-045.png|width=441,height=327! However, the behavior of count_distinct does not account for nulls. Either the documentation or the implementation is wrong here... !image-2024-03-14-16-12-35-267.png! > count_distinct ignores null values > -- > > Key: SPARK-47397 > URL: https://issues.apache.org/jira/browse/SPARK-47397 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Attachments: image-2024-03-14-16-12-35-267.png, > image-2024-03-14-16-13-03-107.png > > > The documentation states, that in group by and count statements, null values > will not be ignored / form their own groups. > !image-2024-03-14-16-13-03-107.png|width=491,height=373! > However, the behavior of count_distinct does not account for nulls. > Either the documentation or the implementation is wrong here... > !image-2024-03-14-16-12-35-267.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47397) count_distinct ignores null values
[ https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Rueckl updated SPARK-47397: -- Attachment: image-2024-03-14-16-13-03-107.png > count_distinct ignores null values > -- > > Key: SPARK-47397 > URL: https://issues.apache.org/jira/browse/SPARK-47397 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Attachments: image-2024-03-14-16-12-35-267.png, > image-2024-03-14-16-13-03-107.png > > > The documentation states, that in group by and count statements, null values > will not be ignored / form their own groups. > !image-2024-03-14-16-09-20-045.png|width=441,height=327! > However, the behavior of count_distinct does not account for nulls. > Either the documentation or the implementation is wrong here... > !image-2024-03-14-16-12-35-267.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47397) count_distinct ignores null values
[ https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Rueckl updated SPARK-47397: -- Description: The documentation states, that in group by and count statements, null values will not be ignored / form their own groups. !image-2024-03-14-16-09-20-045.png|width=441,height=327! However, the behavior of count_distinct does not account for nulls. Either the documentation or the implementation is wrong here... !image-2024-03-14-16-12-35-267.png! was: The documentation states, that in group by and count statements, null values will not be ignored / form their own groups. !image-2024-03-14-16-09-13-065.png|width=757,height=138! !image-2024-03-14-16-09-20-045.png|width=441,height=327! However, the behavior of count_distinct does not account for nulls. Either the documentation or the implementation is wrong here... !image-2024-03-14-16-11-37-714.png! > count_distinct ignores null values > -- > > Key: SPARK-47397 > URL: https://issues.apache.org/jira/browse/SPARK-47397 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Attachments: image-2024-03-14-16-12-35-267.png > > > The documentation states, that in group by and count statements, null values > will not be ignored / form their own groups. > !image-2024-03-14-16-09-20-045.png|width=441,height=327! > However, the behavior of count_distinct does not account for nulls. > Either the documentation or the implementation is wrong here... > !image-2024-03-14-16-12-35-267.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47397) count_distinct ignores null values
Martin Rueckl created SPARK-47397: - Summary: count_distinct ignores null values Key: SPARK-47397 URL: https://issues.apache.org/jira/browse/SPARK-47397 Project: Spark Issue Type: Bug Components: Documentation, Spark Core Affects Versions: 3.4.1 Reporter: Martin Rueckl Attachments: image-2024-03-14-16-12-35-267.png The documentation states, that in group by and count statements, null values will not be ignored / form their own groups. !image-2024-03-14-16-09-13-065.png|width=757,height=138! !image-2024-03-14-16-09-20-045.png|width=441,height=327! However, the behavior of count_distinct does not account for nulls. Either the documentation or the implementation is wrong here... !image-2024-03-14-16-11-37-714.png! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47397) count_distinct ignores null values
[ https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Rueckl updated SPARK-47397: -- Attachment: image-2024-03-14-16-12-35-267.png > count_distinct ignores null values > -- > > Key: SPARK-47397 > URL: https://issues.apache.org/jira/browse/SPARK-47397 > Project: Spark > Issue Type: Bug > Components: Documentation, Spark Core >Affects Versions: 3.4.1 >Reporter: Martin Rueckl >Priority: Critical > Attachments: image-2024-03-14-16-12-35-267.png > > > The documentation states, that in group by and count statements, null values > will not be ignored / form their own groups. > !image-2024-03-14-16-09-13-065.png|width=757,height=138! > !image-2024-03-14-16-09-20-045.png|width=441,height=327! > However, the behavior of count_distinct does not account for nulls. > Either the documentation or the implementation is wrong here... > !image-2024-03-14-16-11-37-714.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47391) Remove the test case workaround for JDK 8
[ https://issues.apache.org/jira/browse/SPARK-47391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47391. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45514 [https://github.com/apache/spark/pull/45514] > Remove the test case workaround for JDK 8 > - > > Key: SPARK-47391 > URL: https://issues.apache.org/jira/browse/SPARK-47391 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Spark SQL test case in ExpressionEncoderSuite fails in windows operation > system. > {code:java} > Internal error (java.io.FileNotFoundException): > D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class > (文件名、目录名或卷标语法不正确。) > java.io.FileNotFoundException: > D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class > (文件名、目录名或卷标语法不正确。) > at java.base/java.io.FileInputStream.open0(Native Method) > at java.base/java.io.FileInputStream.open(FileInputStream.java:216) > at java.base/java.io.FileInputStream.(FileInputStream.java:157) > at > com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211) > at > org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18) > at scala.Option.getOrElse(Option.scala:201) > at > org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17) > at > org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38) > at > org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80) > at > org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569) > at > org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198) > at > org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349) > at > org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163) > at > org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129) > at > com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244) > at > com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30) > at > com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222) > at > com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218) > at > com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:842) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect
[ https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47394: - Assignee: Kent Yao > Support TIMESTAMP WITH TIME ZONE for H2Dialect > -- > > Key: SPARK-47394 > URL: https://issues.apache.org/jira/browse/SPARK-47394 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect
[ https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47394. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45516 [https://github.com/apache/spark/pull/45516] > Support TIMESTAMP WITH TIME ZONE for H2Dialect > -- > > Key: SPARK-47394 > URL: https://issues.apache.org/jira/browse/SPARK-47394 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47208) Allow overriding base overhead memory
[ https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-47208: - Assignee: Joao Correia > Allow overriding base overhead memory > - > > Key: SPARK-47208 > URL: https://issues.apache.org/jira/browse/SPARK-47208 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core, YARN >Affects Versions: 3.5.1 >Reporter: Joao Correia >Assignee: Joao Correia >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We can already select the desired overhead memory directly via the > _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not > present the overhead memory calculation goes as follows: > {code:java} > overhead_memory = Max(384, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor') > where the 'memoryOverheadFactor' flag defaults to 0.1{code} > There are certain times where being able to override the 384Mb minimum > directly can be beneficial. We may have a scenario where a lot of off-heap > operations are performed (ex: using package managers/native > compression/decompression) where we don't have a need for a large JVM heap > but we may still need a signficant amount of memory in the spark node. > Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since > we may not want the overhead allocation to directly scale with JVM memory, as > a cost saving/resource limitation problem. > As such, I propose the addition of a > 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override > the 384Mib value used in the overhead calculation. > The memory overhead calculation will now be : > {code:java} > min_memory = > sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384) > overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor'){code} > PR: https://github.com/apache/spark/pull/45240 > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47388) Pass messageParameters by name to require()
[ https://issues.apache.org/jira/browse/SPARK-47388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47388. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45511 [https://github.com/apache/spark/pull/45511] > Pass messageParameters by name to require() > --- > > Key: SPARK-47388 > URL: https://issues.apache.org/jira/browse/SPARK-47388 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Passing *messageParameters* by value independently from requirement might > introduce perf regression. Need to pass *messageParameters* by name to avoid > eager instantiation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47208) Allow overriding base overhead memory
[ https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-47208. --- Fix Version/s: 4.0.0 Resolution: Fixed > Allow overriding base overhead memory > - > > Key: SPARK-47208 > URL: https://issues.apache.org/jira/browse/SPARK-47208 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core, YARN >Affects Versions: 3.5.1 >Reporter: Joao Correia >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > We can already select the desired overhead memory directly via the > _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not > present the overhead memory calculation goes as follows: > {code:java} > overhead_memory = Max(384, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor') > where the 'memoryOverheadFactor' flag defaults to 0.1{code} > There are certain times where being able to override the 384Mb minimum > directly can be beneficial. We may have a scenario where a lot of off-heap > operations are performed (ex: using package managers/native > compression/decompression) where we don't have a need for a large JVM heap > but we may still need a signficant amount of memory in the spark node. > Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since > we may not want the overhead allocation to directly scale with JVM memory, as > a cost saving/resource limitation problem. > As such, I propose the addition of a > 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override > the 384Mib value used in the overhead calculation. > The memory overhead calculation will now be : > {code:java} > min_memory = > sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384) > overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * > 'spark.driver/executor.memoryOverheadFactor'){code} > PR: https://github.com/apache/spark/pull/45240 > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47388) Pass messageParameters by name to require()
[ https://issues.apache.org/jira/browse/SPARK-47388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47388: --- Labels: pull-request-available (was: ) > Pass messageParameters by name to require() > --- > > Key: SPARK-47388 > URL: https://issues.apache.org/jira/browse/SPARK-47388 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > Passing *messageParameters* by value independently from requirement might > introduce perf regression. Need to pass *messageParameters* by name to avoid > eager instantiation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47336) Provide to PySpark a functionality to get estimated size of DataFrame in bytes
[ https://issues.apache.org/jira/browse/SPARK-47336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827078#comment-17827078 ] Semyon Sinchenko commented on SPARK-47336: -- [~grundprinzip-db] what do you think about `DataFrame.approximate_size_in_bytes() -> float` (or `DataFrame.approximateSizeInBytes() -> float`)? Or, for example, `DataFrame.approx_size_bytes()` to avoid very long names? P.S. I wold like to try to implement it, may you assign it on me? > Provide to PySpark a functionality to get estimated size of DataFrame in bytes > -- > > Key: SPARK-47336 > URL: https://issues.apache.org/jira/browse/SPARK-47336 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Semyon Sinchenko >Priority: Minor > > Something equal to > sessionState().executePlan(...).optimizedPlan().stats().sizeInBytes() in > JVM-Spark. It may be done via simple call of `_jsparkSession` in a regular > PySpark and via a plugin for Spark Connect. > > This functionality is useful when one need to check a possibility of > broadcast join without modifying global broadcast threshold. > > The function in PySpark API may looks like: > `DataFrame.estimate_size_in_bytes() -> float` or > `DataFrame.estimateSizeInBytes() -> float`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47336) Provide to PySpark a functionality to get estimated size of DataFrame in bytes
[ https://issues.apache.org/jira/browse/SPARK-47336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827077#comment-17827077 ] Martin Grund commented on SPARK-47336: -- I think the general idea is great! I would like to propose to change the name to reflect that this is most likely a size estimation though. > Provide to PySpark a functionality to get estimated size of DataFrame in bytes > -- > > Key: SPARK-47336 > URL: https://issues.apache.org/jira/browse/SPARK-47336 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Semyon Sinchenko >Priority: Minor > > Something equal to > sessionState().executePlan(...).optimizedPlan().stats().sizeInBytes() in > JVM-Spark. It may be done via simple call of `_jsparkSession` in a regular > PySpark and via a plugin for Spark Connect. > > This functionality is useful when one need to check a possibility of > broadcast join without modifying global broadcast threshold. > > The function in PySpark API may looks like: > `DataFrame.estimate_size_in_bytes() -> float` or > `DataFrame.estimateSizeInBytes() -> float`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47379) Improve docker jdbc suite test reliability
[ https://issues.apache.org/jira/browse/SPARK-47379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47379: --- Labels: pull-request-available (was: ) > Improve docker jdbc suite test reliability > -- > > Key: SPARK-47379 > URL: https://issues.apache.org/jira/browse/SPARK-47379 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Milan Stefanovic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
[ https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47390: Assignee: Kent Yao > PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ > - > > Key: SPARK-47390 > URL: https://issues.apache.org/jira/browse/SPARK-47390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
[ https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47390. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45513 [https://github.com/apache/spark/pull/45513] > PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ > - > > Key: SPARK-47390 > URL: https://issues.apache.org/jira/browse/SPARK-47390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47395) Add collate and collation to non-sql APIs
Stefan Kandic created SPARK-47395: - Summary: Add collate and collation to non-sql APIs Key: SPARK-47395 URL: https://issues.apache.org/jira/browse/SPARK-47395 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect
[ https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47394: --- Labels: pull-request-available (was: ) > Support TIMESTAMP WITH TIME ZONE for H2Dialect > -- > > Key: SPARK-47394 > URL: https://issues.apache.org/jira/browse/SPARK-47394 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect
Kent Yao created SPARK-47394: Summary: Support TIMESTAMP WITH TIME ZONE for H2Dialect Key: SPARK-47394 URL: https://issues.apache.org/jira/browse/SPARK-47394 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47393) Collation info should be exposed through system view
Aleksandar Tomic created SPARK-47393: Summary: Collation info should be exposed through system view Key: SPARK-47393 URL: https://issues.apache.org/jira/browse/SPARK-47393 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47392) Compiler stats should respect collation
Aleksandar Tomic created SPARK-47392: Summary: Compiler stats should respect collation Key: SPARK-47392 URL: https://issues.apache.org/jira/browse/SPARK-47392 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 4.0.0 Reporter: Aleksandar Tomic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47391) Remove the test case workaround for JDK 8
[ https://issues.apache.org/jira/browse/SPARK-47391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47391: --- Labels: pull-request-available (was: ) > Remove the test case workaround for JDK 8 > - > > Key: SPARK-47391 > URL: https://issues.apache.org/jira/browse/SPARK-47391 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Spark SQL test case in ExpressionEncoderSuite fails in windows operation > system. > {code:java} > Internal error (java.io.FileNotFoundException): > D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class > (文件名、目录名或卷标语法不正确。) > java.io.FileNotFoundException: > D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class > (文件名、目录名或卷标语法不正确。) > at java.base/java.io.FileInputStream.open0(Native Method) > at java.base/java.io.FileInputStream.open(FileInputStream.java:216) > at java.base/java.io.FileInputStream.(FileInputStream.java:157) > at > com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211) > at > org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18) > at scala.Option.getOrElse(Option.scala:201) > at > org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17) > at > org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38) > at > org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80) > at > org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569) > at > org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198) > at > org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349) > at > org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163) > at > org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129) > at > com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244) > at > com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30) > at > com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222) > at > com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218) > at > com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:842) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47391) Remove the test case workaround for JDK 8
Jiaan Geng created SPARK-47391: -- Summary: Remove the test case workaround for JDK 8 Key: SPARK-47391 URL: https://issues.apache.org/jira/browse/SPARK-47391 Project: Spark Issue Type: Test Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng Spark SQL test case in ExpressionEncoderSuite fails in windows operation system. {code:java} Internal error (java.io.FileNotFoundException): D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class (文件名、目录名或卷标语法不正确。) java.io.FileNotFoundException: D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class (文件名、目录名或卷标语法不正确。) at java.base/java.io.FileInputStream.open0(Native Method) at java.base/java.io.FileInputStream.open(FileInputStream.java:216) at java.base/java.io.FileInputStream.(FileInputStream.java:157) at com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211) at org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18) at scala.Option.getOrElse(Option.scala:201) at org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17) at org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38) at org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80) at org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569) at org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198) at org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349) at org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163) at org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129) at com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244) at com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30) at com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222) at com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218) at com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:842) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
[ https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47390: --- Labels: pull-request-available (was: ) > PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ > - > > Key: SPARK-47390 > URL: https://issues.apache.org/jira/browse/SPARK-47390 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
Kent Yao created SPARK-47390: Summary: PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ Key: SPARK-47390 URL: https://issues.apache.org/jira/browse/SPARK-47390 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47389) spark jdbc one insert with multiple values
melin created SPARK-47389: - Summary: spark jdbc one insert with multiple values Key: SPARK-47389 URL: https://issues.apache.org/jira/browse/SPARK-47389 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 4.0.0 Reporter: melin Many databases support a single insert sql to write multiple rows of data. Write performance is more efficient than batch execution of multiple sql files https://github.com/apache/spark/blob/9986462811f160eacd766da8a4e14a9cbb4b8710/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L725 example: {code:java} INSERT INTO Customers (Name, Age, Active) ('Name1',21,1) INSERT INTO Customers (Name, Age, Active) ('Name2',21,1) Vs INSERT INTO Customers (Name, Age, Active) ('Name1',21,1), ('Name2',21,1) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47388) Pass messageParameters by name to require()
Max Gekk created SPARK-47388: Summary: Pass messageParameters by name to require() Key: SPARK-47388 URL: https://issues.apache.org/jira/browse/SPARK-47388 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Passing *messageParameters* by value independently from requirement might introduce perf regression. Need to pass *messageParameters* by name to avoid eager instantiation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47385) Tuple encoder produces wrong results with Option inputs
[ https://issues.apache.org/jira/browse/SPARK-47385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47385. - Fix Version/s: 3.4.3 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45508 [https://github.com/apache/spark/pull/45508] > Tuple encoder produces wrong results with Option inputs > --- > > Key: SPARK-47385 > URL: https://issues.apache.org/jira/browse/SPARK-47385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.4 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Labels: pull-request-available > Fix For: 3.4.3, 3.5.2, 4.0.0 > > > > The behavior of tupled encoders on the Option type was changed by > https://github.com/apache/spark/pull/40755. > {code:java} > import org.apache.spark.sql.{Encoders, Encoder} > case class Required(name: String) > case class Optional(name: String) > implicit val enc: Encoder[(Required, Option[Optional])] = > Encoders.tuple(Encoders.product[Required], > Encoders.product[Option[Optional]]) > > spark.createDataFrame(Seq( > (Required("1"), Some(Optional("1"))), > (Required("2"), None) > )).as[(Required, Option[Optional])].collect(){code} > Before the PR, the result is: > {code:java} > Array((Required(1),Some(Optional(1))), (Required(2),None)){code} > After the PR, the result is: > {code:java} > Array((Required(1),Some(Optional(1))), (Required(2),null)) {code} > which is incorrect because the original input is None rather than null. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47385) Tuple encoder produces wrong results with Option inputs
[ https://issues.apache.org/jira/browse/SPARK-47385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47385: --- Assignee: Chenhao Li > Tuple encoder produces wrong results with Option inputs > --- > > Key: SPARK-47385 > URL: https://issues.apache.org/jira/browse/SPARK-47385 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.4 >Reporter: Chenhao Li >Assignee: Chenhao Li >Priority: Major > Labels: pull-request-available > > > The behavior of tupled encoders on the Option type was changed by > https://github.com/apache/spark/pull/40755. > {code:java} > import org.apache.spark.sql.{Encoders, Encoder} > case class Required(name: String) > case class Optional(name: String) > implicit val enc: Encoder[(Required, Option[Optional])] = > Encoders.tuple(Encoders.product[Required], > Encoders.product[Option[Optional]]) > > spark.createDataFrame(Seq( > (Required("1"), Some(Optional("1"))), > (Required("2"), None) > )).as[(Required, Option[Optional])].collect(){code} > Before the PR, the result is: > {code:java} > Array((Required(1),Some(Optional(1))), (Required(2),None)){code} > After the PR, the result is: > {code:java} > Array((Required(1),Some(Optional(1))), (Required(2),null)) {code} > which is incorrect because the original input is None rather than null. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47387) Remove some unused error classes
[ https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47387: --- Labels: pull-request-available (was: ) > Remove some unused error classes > > > Key: SPARK-47387 > URL: https://issues.apache.org/jira/browse/SPARK-47387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47374) Fix connect-repl `usage prompt` & `docs link`
[ https://issues.apache.org/jira/browse/SPARK-47374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47374. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45494 [https://github.com/apache/spark/pull/45494] > Fix connect-repl `usage prompt` & `docs link` > - > > Key: SPARK-47374 > URL: https://issues.apache.org/jira/browse/SPARK-47374 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47374) Fix connect-repl `usage prompt` & `docs link`
[ https://issues.apache.org/jira/browse/SPARK-47374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47374: Assignee: BingKun Pan > Fix connect-repl `usage prompt` & `docs link` > - > > Key: SPARK-47374 > URL: https://issues.apache.org/jira/browse/SPARK-47374 > Project: Spark > Issue Type: Bug > Components: Connect, Documentation >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47377) Factor out tests from `SparkConnectSQLTestCase`
[ https://issues.apache.org/jira/browse/SPARK-47377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47377: Assignee: Ruifeng Zheng > Factor out tests from `SparkConnectSQLTestCase` > --- > > Key: SPARK-47377 > URL: https://issues.apache.org/jira/browse/SPARK-47377 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47377) Factor out tests from `SparkConnectSQLTestCase`
[ https://issues.apache.org/jira/browse/SPARK-47377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47377. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45497 [https://github.com/apache/spark/pull/45497] > Factor out tests from `SparkConnectSQLTestCase` > --- > > Key: SPARK-47377 > URL: https://issues.apache.org/jira/browse/SPARK-47377 > Project: Spark > Issue Type: Sub-task > Components: Connect, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47387) Remove some unused error classes
BingKun Pan created SPARK-47387: --- Summary: Remove some unused error classes Key: SPARK-47387 URL: https://issues.apache.org/jira/browse/SPARK-47387 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org