[jira] [Resolved] (SPARK-47234) Upgrade Scala to 2.13.13

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47234.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45342
[https://github.com/apache/spark/pull/45342]

> Upgrade Scala to 2.13.13
> 
>
> Key: SPARK-47234
> URL: https://issues.apache.org/jira/browse/SPARK-47234
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47406) Handle TIMESTAMP and DATETIME in MYSQLDialect

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47406:
---
Labels: pull-request-available  (was: )

> Handle TIMESTAMP and DATETIME in MYSQLDialect 
> --
>
> Key: SPARK-47406
> URL: https://issues.apache.org/jira/browse/SPARK-47406
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47406) Handle TIMESTAMP and DATETIME in MYSQLDialect

2024-03-14 Thread Kent Yao (Jira)
Kent Yao created SPARK-47406:


 Summary: Handle TIMESTAMP and DATETIME in MYSQLDialect 
 Key: SPARK-47406
 URL: https://issues.apache.org/jira/browse/SPARK-47406
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47405) Remove `JLine 2` dependency

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47405:
---
Labels: pull-request-available  (was: )

> Remove `JLine 2` dependency 
> 
>
> Key: SPARK-47405
> URL: https://issues.apache.org/jira/browse/SPARK-47405
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47405) Remove `JLine 2` dependency

2024-03-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47405:
-

 Summary: Remove `JLine 2` dependency 
 Key: SPARK-47405
 URL: https://issues.apache.org/jira/browse/SPARK-47405
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47366) Implement parse_json

2024-03-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47366.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45479
[https://github.com/apache/spark/pull/45479]

> Implement parse_json
> 
>
> Key: SPARK-47366
> URL: https://issues.apache.org/jira/browse/SPARK-47366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47366) Implement parse_json

2024-03-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47366:
---

Assignee: Chenhao Li

> Implement parse_json
> 
>
> Key: SPARK-47366
> URL: https://issues.apache.org/jira/browse/SPARK-47366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47404) Add hooks to release the ANTLR DFA cache after parsing SQL

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47404:
---
Labels: pull-request-available  (was: )

> Add hooks to release the ANTLR DFA cache after parsing SQL
> --
>
> Key: SPARK-47404
> URL: https://issues.apache.org/jira/browse/SPARK-47404
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
>
> ANTLR builds a DFA cache while parsing to speed up parsing of similar future 
> inputs. However, this cache is never cleared and can only grow. Extremely 
> large SQL inputs can lead to very large DFA caches (>20GiB in one extreme 
> case I've seen).
> Spark’s ANTLR SQL parser is derived from the Presto ANTLR SQL Parser, and 
> Presto has added hooks to be able to clear this DFA cache. I think Spark 
> should have similar hooks.
> References:
>  * 
> [https://github.com/antlr/antlr4/blob/f08a19bbb202b02a521f84d99e661e386bea8625/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java#L163-L171]
>  * 
> [https://stackoverflow.com/questions/28017135/why-antlr4-parsers-accumulates-atnconfig-objects?rq=2]
>  * [https://github.com/antlr/antlr4/issues/499]
>  * 
> [https://github.com/trinodb/trino/pull/3186/files#diff-75b81ed5837578d1af42fcc91e4094a247138e5da6edb9d9e4b67d53247b8ca9]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47404) Add hooks to release the ANTLR DFA cache after parsing SQL

2024-03-14 Thread Mark Jarvin (Jira)
Mark Jarvin created SPARK-47404:
---

 Summary: Add hooks to release the ANTLR DFA cache after parsing SQL
 Key: SPARK-47404
 URL: https://issues.apache.org/jira/browse/SPARK-47404
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mark Jarvin


ANTLR builds a DFA cache while parsing to speed up parsing of similar future 
inputs. However, this cache is never cleared and can only grow. Extremely large 
SQL inputs can lead to very large DFA caches (>20GiB in one extreme case I've 
seen).

Spark’s ANTLR SQL parser is derived from the Presto ANTLR SQL Parser, and 
Presto has added hooks to be able to clear this DFA cache. I think Spark should 
have similar hooks.

References:
 * 
[https://github.com/antlr/antlr4/blob/f08a19bbb202b02a521f84d99e661e386bea8625/runtime/Java/src/org/antlr/v4/runtime/atn/ParserATNSimulator.java#L163-L171]

 * 
[https://stackoverflow.com/questions/28017135/why-antlr4-parsers-accumulates-atnconfig-objects?rq=2]

 * [https://github.com/antlr/antlr4/issues/499]

 * 
[https://github.com/trinodb/trino/pull/3186/files#diff-75b81ed5837578d1af42fcc91e4094a247138e5da6edb9d9e4b67d53247b8ca9]

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45376) Add netty-tcnative-boringssl-static dependency

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45376:
--
Summary: Add netty-tcnative-boringssl-static dependency  (was: [CORE] Add 
netty-tcnative-boringssl-static dependency)

> Add netty-tcnative-boringssl-static dependency
> --
>
> Key: SPARK-45376
> URL: https://issues.apache.org/jira/browse/SPARK-45376
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add the boringssl dependency which is needed for SSL functionality to work, 
> and provide the network common test helper to other test modules which need 
> to test SSL functionality



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-47342.
-

>  Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
> --
>
> Key: SPARK-47342
> URL: https://issues.apache.org/jira/browse/SPARK-47342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE

2024-03-14 Thread Dongjoon Hyun (Jira)


[ https://issues.apache.org/jira/browse/SPARK-47342 ]


Dongjoon Hyun deleted comment on SPARK-47342:
---

was (Author: dongjoon):
Issue resolved by pull request 45471
[https://github.com/apache/spark/pull/45471]

>  Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
> --
>
> Key: SPARK-47342
> URL: https://issues.apache.org/jira/browse/SPARK-47342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47342) Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE

2024-03-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827258#comment-17827258
 ] 

Dongjoon Hyun commented on SPARK-47342:
---

Thank you for providing the context.

>  Support TimestampNTZ for DB2 TIMESTAMP WITH TIME ZONE
> --
>
> Key: SPARK-47342
> URL: https://issues.apache.org/jira/browse/SPARK-47342
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47402) Upgrade `ZooKeeper` to 3.9.2

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47402.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45524
[https://github.com/apache/spark/pull/45524]

> Upgrade `ZooKeeper` to 3.9.2
> 
>
> Key: SPARK-47402
> URL: https://issues.apache.org/jira/browse/SPARK-47402
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46305) Remove the special Zookeeper version in the `streaming-kafka-0-10` and `sql-kafka-0-10` modules

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46305:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

>  Remove the special Zookeeper version in the `streaming-kafka-0-10` and 
> `sql-kafka-0-10` modules
> 
>
> Key: SPARK-46305
> URL: https://issues.apache.org/jira/browse/SPARK-46305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39420) Support ANALYZE TABLE on v2 tables

2024-03-14 Thread Felipe (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827234#comment-17827234
 ] 

Felipe commented on SPARK-39420:


Hi. The PR [https://github.com/apache/spark/pull/4] has been closed without 
merging it.

Anyone has updates about it? Any chances to be implemented?

> Support ANALYZE TABLE on v2 tables
> --
>
> Key: SPARK-39420
> URL: https://issues.apache.org/jira/browse/SPARK-39420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.1, 3.3.4
>Reporter: Felipe
>Priority: Major
>  Labels: pull-request-available
>
> According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE 
> TABLE in Delta, we need to add the missing APIs in Spark to allow a data 
> source to report the file set to calculate the stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39420) Support ANALYZE TABLE on v2 tables

2024-03-14 Thread Felipe (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felipe updated SPARK-39420:
---
Affects Version/s: 3.3.4
   3.5.1

> Support ANALYZE TABLE on v2 tables
> --
>
> Key: SPARK-39420
> URL: https://issues.apache.org/jira/browse/SPARK-39420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.1, 3.3.4
>Reporter: Felipe
>Priority: Major
>  Labels: pull-request-available
>
> According to https://github.com/delta-io/delta/pull/840 to implement ANALYZE 
> TABLE in Delta, we need to add the missing APIs in Spark to allow a data 
> source to report the file set to calculate the stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47396) Add a general mapping for TIME WITHOUT TIME ZONE to TimestampNTZType

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47396.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45519
[https://github.com/apache/spark/pull/45519]

> Add a general mapping for TIME WITHOUT TIME ZONE to TimestampNTZType
> 
>
> Key: SPARK-47396
> URL: https://issues.apache.org/jira/browse/SPARK-47396
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827206#comment-17827206
 ] 

Raza Jafri commented on SPARK-47398:


In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101 might be lost or the 
worst case could be that we try to look for the said Exec and throw an 
exception 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>  Labels: pull-request-available
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

  was:
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101 might be lost or the 
worst case could be that we try to look for the said Exec and throw an 
exception 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>  Labels: pull-request-available
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47398:
---
Labels: pull-request-available  (was: )

> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>  Labels: pull-request-available
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we 
> replace the `InMemoryTableScanExec` with our version which does some 
> optimizations. This could cause a problem as the benefits of SPARK-42101 
> might be lost or the worst case could be that we try to look for the said 
> Exec and throw an exception 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101 might be lost or the 
worst case could be that we try to look for the said Exec and throw an 
exception 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101 might be lost or 
worst case could be that we try to look for the 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we 
> replace the `InMemoryTableScanExec` with our version which does some 
> optimizations. This could cause a problem as the benefits of SPARK-42101 
> might be lost or the worst case could be that we try to look for the said 
> Exec and throw an exception 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101 might be lost or 
worst case could be that we try to look for the 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101, might be lost 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we 
> replace the `InMemoryTableScanExec` with our version which does some 
> optimizations. This could cause a problem as the benefits of SPARK-42101 
> might be lost or worst case could be that we try to look for the 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101, we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we replace 
the `InMemoryTableScanExec` with our version which does some optimizations. 
This could cause a problem as the benefits of SPARK-42101, might be lost 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101, we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. In the RAPIDS Accelerator for Apache Spark, we 
> replace the `InMemoryTableScanExec` with our version which does some 
> optimizations. This could cause a problem as the benefits of SPARK-42101, 
> might be lost 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47387) Remove some unused error classes

2024-03-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47387.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45509
[https://github.com/apache/spark/pull/45509]

> Remove some unused error classes
> 
>
> Key: SPARK-47387
> URL: https://issues.apache.org/jira/browse/SPARK-47387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47387) Remove some unused error classes

2024-03-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47387:


Assignee: BingKun Pan

> Remove some unused error classes
> 
>
> Key: SPARK-47387
> URL: https://issues.apache.org/jira/browse/SPARK-47387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in 
TableCacheQueryStageExec. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for Exchange by matching against ShuffleExchangeLike and 
BroadcastExchangeLike. 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101 we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in 
TableCacheQueryStageExec. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for Exchange by matching against ShuffleExchangeLike and 
BroadcastExchangeLike. 

 

Looking at the current code, I propose the trait to be as 
{code:java}
trait InMemoryTableScanLike extends LeafExecNode {  
  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean  

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]  

  /**
   * Returns the runtime statistics after shuffle materialization.  
   */
  def runtimeStatistics: Statistics
} {code}
This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. 

 

Looking at the current code, I propose the trait to be as 

 

```

trait InMemoryTableScanLike extends LeafExecNode {

  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]

  /**
   * Returns the runtime statistics after shuffle materialization.
   */
  def runtimeStatistics: Statistics
}

```

This is just based on what I know about how AQE is using it. 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101 we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In AdaptiveSparkPlanExec we are wrapping InMemoryTableScanExec in 
> TableCacheQueryStageExec. To accomplish this we are currently matching on the 
> Exec, I am proposing that we should match on a trait instead just like how we 
> do it for Exchange by matching against ShuffleExchangeLike and 
> BroadcastExchangeLike. 
>  
> Looking at the current code, I propose the trait to be as 
> {code:java}
> trait InMemoryTableScanLike extends LeafExecNode {  
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean  
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]  
>   /**
>    * Returns the runtime statistics after shuffle materialization.  
>    */
>   def runtimeStatistics: Statistics
> } {code}
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Description: 
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 

This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
`TableCacheQueryStageExec`. To accomplish this we are currently matching on the 
Exec, I am proposing that we should match on a trait instead just like how we 
do it for `Exchange` by matching against `ShuffleExchangeLike` and 
`BroadcastExchangeLike`. 

 

Looking at the current code, I propose the trait to be as 

 

```

trait InMemoryTableScanLike extends LeafExecNode {

  /**
   * Returns whether the cache buffer is loaded
   */
  def isMaterialized: Boolean

  /**
   * Returns the actual cached RDD without filters and serialization of 
row/columnar.
   */
  def baseCacheRDD(): RDD[CachedBatch]

  /**
   * Returns the runtime statistics after shuffle materialization.
   */
  def runtimeStatistics: Statistics
}

```

This is just based on what I know about how AQE is using it. 

  was:
As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 


This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

 


> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101 we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
> In `AdaptiveSparkPlanExec` we are wrapping `InMemoryTableScanExec` in 
> `TableCacheQueryStageExec`. To accomplish this we are currently matching on 
> the Exec, I am proposing that we should match on a trait instead just like 
> how we do it for `Exchange` by matching against `ShuffleExchangeLike` and 
> `BroadcastExchangeLike`. 
>  
> Looking at the current code, I propose the trait to be as 
>  
> ```
> trait InMemoryTableScanLike extends LeafExecNode {
>   /**
>    * Returns whether the cache buffer is loaded
>    */
>   def isMaterialized: Boolean
>   /**
>    * Returns the actual cached RDD without filters and serialization of 
> row/columnar.
>    */
>   def baseCacheRDD(): RDD[CachedBatch]
>   /**
>    * Returns the runtime statistics after shuffle materialization.
>    */
>   def runtimeStatistics: Statistics
> }
> ```
> This is just based on what I know about how AQE is using it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raza Jafri updated SPARK-47398:
---
Target Version/s:   (was: 4.0.0)

> AQE doesn't allow for extension of InMemoryTableScanExec
> 
>
> Key: SPARK-47398
> URL: https://issues.apache.org/jira/browse/SPARK-47398
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 3.5.1
>Reporter: Raza Jafri
>Priority: Major
>
> As part of SPARK-42101 we added support to AQE for handling 
> InMemoryTableScanExec. 
> This change directly references `InMemoryTableScanExec` which limits users 
> from extending the caching functionality that was added as part of 
> SPARK-32274 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47401) Update `YuniKorn` docs with v1.5

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47401.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45523
[https://github.com/apache/spark/pull/45523]

> Update `YuniKorn` docs with v1.5
> 
>
> Key: SPARK-47401
> URL: https://issues.apache.org/jira/browse/SPARK-47401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47401) Update `YuniKorn` docs with v1.5

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47401:
-

Assignee: Dongjoon Hyun

> Update `YuniKorn` docs with v1.5
> 
>
> Key: SPARK-47401
> URL: https://issues.apache.org/jira/browse/SPARK-47401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46210) Update `YuniKorn` docs with v1.4

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46210:
--
Summary: Update `YuniKorn` docs with v1.4  (was: Update YuniKorn docs with 
v1.4)

> Update `YuniKorn` docs with v1.4
> 
>
> Key: SPARK-46210
> URL: https://issues.apache.org/jira/browse/SPARK-46210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47401) Update `YuniKorn` docs with v1.5

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47401:
---
Labels: pull-request-available  (was: )

> Update `YuniKorn` docs with v1.5
> 
>
> Key: SPARK-47401
> URL: https://issues.apache.org/jira/browse/SPARK-47401
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47401) Update `YuniKorn` docs with v1.5

2024-03-14 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-47401:
-

 Summary: Update `YuniKorn` docs with v1.5
 Key: SPARK-47401
 URL: https://issues.apache.org/jira/browse/SPARK-47401
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, Kubernetes
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47400) Upgrade `gcs-connector` to 2.2.20

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47400.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45521
[https://github.com/apache/spark/pull/45521]

> Upgrade `gcs-connector` to 2.2.20
> -
>
> Key: SPARK-47400
> URL: https://issues.apache.org/jira/browse/SPARK-47400
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44197) Upgrade Hadoop to 3.3.6

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44197:
--
Fix Version/s: 4.0.0
   (was: 3.5.0)

> Upgrade Hadoop to 3.3.6
> ---
>
> Key: SPARK-44197
> URL: https://issues.apache.org/jira/browse/SPARK-44197
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47337) Bump DB2 Docker version to 11.5.8.0

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47337:
--
Parent Issue: SPARK-47361  (was: SPARK-47046)

> Bump DB2 Docker version to 11.5.8.0
> ---
>
> Key: SPARK-47337
> URL: https://issues.apache.org/jira/browse/SPARK-47337
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47337) Bump DB2 Docker version to 11.5.8.0

2024-03-14 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827152#comment-17827152
 ] 

Dongjoon Hyun commented on SPARK-47337:
---

I changed this subtask from SPARK-47046 to SPARK-47361 .

> Bump DB2 Docker version to 11.5.8.0
> ---
>
> Key: SPARK-47337
> URL: https://issues.apache.org/jira/browse/SPARK-47337
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45850:
--
Parent Issue: SPARK-47361  (was: SPARK-47046)

> Upgrade oracle jdbc driver to 23.3.0.23.09 
> ---
>
> Key: SPARK-45850
> URL: https://issues.apache.org/jira/browse/SPARK-45850
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc 
> driver version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47384) Upgrade RoaringBitmap to 1.0.5

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47384.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45507
[https://github.com/apache/spark/pull/45507]

> Upgrade RoaringBitmap to 1.0.5
> --
>
> Key: SPARK-47384
> URL: https://issues.apache.org/jira/browse/SPARK-47384
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47395) Add collate and collation to non-sql APIs

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47395:
---
Labels: pull-request-available  (was: )

> Add collate and collation to non-sql APIs
> -
>
> Key: SPARK-47395
> URL: https://issues.apache.org/jira/browse/SPARK-47395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47399) Disable generated columns on expressions with collations

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47399:
---
Labels: pull-request-available  (was: )

> Disable generated columns on expressions with collations
> 
>
> Key: SPARK-47399
> URL: https://issues.apache.org/jira/browse/SPARK-47399
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> Changing the collation of a column or even just changing the ICU version 
> could lead to a differences in the resulting expression so it would be best 
> if we simply disable it for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47399) Disable generated columns on expressions with collations

2024-03-14 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-47399:
-

 Summary: Disable generated columns on expressions with collations
 Key: SPARK-47399
 URL: https://issues.apache.org/jira/browse/SPARK-47399
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic


Changing the collation of a column or even just changing the ICU version could 
lead to a differences in the resulting expression so it would be best if we 
simply disable it for now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47398) AQE doesn't allow for extension of InMemoryTableScanExec

2024-03-14 Thread Raza Jafri (Jira)
Raza Jafri created SPARK-47398:
--

 Summary: AQE doesn't allow for extension of InMemoryTableScanExec
 Key: SPARK-47398
 URL: https://issues.apache.org/jira/browse/SPARK-47398
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.5.0
Reporter: Raza Jafri


As part of SPARK-42101 we added support to AQE for handling 
InMemoryTableScanExec. 


This change directly references `InMemoryTableScanExec` which limits users from 
extending the caching functionality that was added as part of SPARK-32274 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-46876) Data is silently lost in Tab separated CSV with empty (whitespace) rows

2024-03-14 Thread Martin Rueckl (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827135#comment-17827135
 ] 

Martin Rueckl commented on SPARK-46876:
---

[~doki] any chance to make progress on this?

> Data is silently lost in Tab separated CSV with empty (whitespace) rows
> ---
>
> Key: SPARK-46876
> URL: https://issues.apache.org/jira/browse/SPARK-46876
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
>  Labels: pull-request-available
>
> When reading a tab separated file that contains lines that only contain tabs 
> (i.e. empty strings as values of the columns for that row), then these rows 
> will silently be skipped (as empty lines) and the resulting dataframe will 
> have less rows than expected.
> This behavior is inconsistent with the behavior for e.g. semicolon separated 
> files, where the resulting dataframe will have a row with only empty string 
> values.
> A minimal reproducible example would look like:
> A minimal reproducible example: A file containing this
> {code:java}
> a\tb\tc\r\n
> \t\t\r\n
> 1\t2\t3{code}
> will create a dataframe with one row (a=1,b=2,c=3)
> whereas this
> {code:java}
> a;b;c\r\n
> ;;\r\n
> 1;2;3{code}
> will read as two rows (first row contains empty strings)
> I used the following pyspark command to read the dataframes
> {code:java}
>  spark.read.option("header","true").option("sep","\t").csv(" file>").collect()
> spark.read.option("header","true").option("sep",";").csv(" file>").collect()
> {code}
> I ran into this particularly on databricks (I assume they use the same 
> reader), but [this stack overflow 
> post|https://stackoverflow.com/questions/47823858/replacing-empty-lines-with-characters-when-reading-csv-using-spark#comment137288546_47823858]
>  indicates, that this is an old issue that may have been taken over from 
> databricks when their csv reader was adopted in SPARK-12420
> I recommend to at least add a respective test case to the CSV reader.
>  
> Why is this behaviour a problem:
>  * It violates some of the core assumptions
>  ** a properly configured roundtrip via csv write/read should result in the 
> same set of rows
>  ** changing the csv separator (when everything is properly esacped) should 
> have no effect
> Potential resolutions:
>  * When the configured delimiter consists of only whitespace
>  ** deactivate the "skip empty line feature"
>  ** or skip only lines that are completely empty (only a (carriage return) 
> newline)
>  * Change the skip empty line feature to only skip if the line is completely 
> empty (only contains a newlin)
>  ** this may break some user code that relies on the current behaviour



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47397) count_distinct ignores null values

2024-03-14 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-47397:
--
Description: 
The documentation states, that in group by and count statements, null values 
will not be ignored / form their own groups.

!image-2024-03-14-16-13-03-107.png|width=491,height=373!
However, the behavior of count_distinct does not account for nulls. 
Either the documentation or the implementation is wrong here...

!image-2024-03-14-16-12-35-267.png!

  was:
The documentation states, that in group by and count statements, null values 
will not be ignored / form their own groups.

!image-2024-03-14-16-09-20-045.png|width=441,height=327!
However, the behavior of count_distinct does not account for nulls. 
Either the documentation or the implementation is wrong here...

!image-2024-03-14-16-12-35-267.png!


> count_distinct ignores null values
> --
>
> Key: SPARK-47397
> URL: https://issues.apache.org/jira/browse/SPARK-47397
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
> Attachments: image-2024-03-14-16-12-35-267.png, 
> image-2024-03-14-16-13-03-107.png
>
>
> The documentation states, that in group by and count statements, null values 
> will not be ignored / form their own groups.
> !image-2024-03-14-16-13-03-107.png|width=491,height=373!
> However, the behavior of count_distinct does not account for nulls. 
> Either the documentation or the implementation is wrong here...
> !image-2024-03-14-16-12-35-267.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47397) count_distinct ignores null values

2024-03-14 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-47397:
--
Attachment: image-2024-03-14-16-13-03-107.png

> count_distinct ignores null values
> --
>
> Key: SPARK-47397
> URL: https://issues.apache.org/jira/browse/SPARK-47397
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
> Attachments: image-2024-03-14-16-12-35-267.png, 
> image-2024-03-14-16-13-03-107.png
>
>
> The documentation states, that in group by and count statements, null values 
> will not be ignored / form their own groups.
> !image-2024-03-14-16-09-20-045.png|width=441,height=327!
> However, the behavior of count_distinct does not account for nulls. 
> Either the documentation or the implementation is wrong here...
> !image-2024-03-14-16-12-35-267.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47397) count_distinct ignores null values

2024-03-14 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-47397:
--
Description: 
The documentation states, that in group by and count statements, null values 
will not be ignored / form their own groups.

!image-2024-03-14-16-09-20-045.png|width=441,height=327!
However, the behavior of count_distinct does not account for nulls. 
Either the documentation or the implementation is wrong here...

!image-2024-03-14-16-12-35-267.png!

  was:
The documentation states, that in group by and count statements, null values 
will not be ignored / form their own groups.
!image-2024-03-14-16-09-13-065.png|width=757,height=138!
!image-2024-03-14-16-09-20-045.png|width=441,height=327!
However, the behavior of count_distinct does not account for nulls. 
Either the documentation or the implementation is wrong here...

!image-2024-03-14-16-11-37-714.png!

 


> count_distinct ignores null values
> --
>
> Key: SPARK-47397
> URL: https://issues.apache.org/jira/browse/SPARK-47397
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
> Attachments: image-2024-03-14-16-12-35-267.png
>
>
> The documentation states, that in group by and count statements, null values 
> will not be ignored / form their own groups.
> !image-2024-03-14-16-09-20-045.png|width=441,height=327!
> However, the behavior of count_distinct does not account for nulls. 
> Either the documentation or the implementation is wrong here...
> !image-2024-03-14-16-12-35-267.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47397) count_distinct ignores null values

2024-03-14 Thread Martin Rueckl (Jira)
Martin Rueckl created SPARK-47397:
-

 Summary: count_distinct ignores null values
 Key: SPARK-47397
 URL: https://issues.apache.org/jira/browse/SPARK-47397
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Affects Versions: 3.4.1
Reporter: Martin Rueckl
 Attachments: image-2024-03-14-16-12-35-267.png

The documentation states, that in group by and count statements, null values 
will not be ignored / form their own groups.
!image-2024-03-14-16-09-13-065.png|width=757,height=138!
!image-2024-03-14-16-09-20-045.png|width=441,height=327!
However, the behavior of count_distinct does not account for nulls. 
Either the documentation or the implementation is wrong here...

!image-2024-03-14-16-11-37-714.png!

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47397) count_distinct ignores null values

2024-03-14 Thread Martin Rueckl (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Rueckl updated SPARK-47397:
--
Attachment: image-2024-03-14-16-12-35-267.png

> count_distinct ignores null values
> --
>
> Key: SPARK-47397
> URL: https://issues.apache.org/jira/browse/SPARK-47397
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 3.4.1
>Reporter: Martin Rueckl
>Priority: Critical
> Attachments: image-2024-03-14-16-12-35-267.png
>
>
> The documentation states, that in group by and count statements, null values 
> will not be ignored / form their own groups.
> !image-2024-03-14-16-09-13-065.png|width=757,height=138!
> !image-2024-03-14-16-09-20-045.png|width=441,height=327!
> However, the behavior of count_distinct does not account for nulls. 
> Either the documentation or the implementation is wrong here...
> !image-2024-03-14-16-11-37-714.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47391) Remove the test case workaround for JDK 8

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47391.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45514
[https://github.com/apache/spark/pull/45514]

> Remove the test case workaround for JDK 8
> -
>
> Key: SPARK-47391
> URL: https://issues.apache.org/jira/browse/SPARK-47391
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Spark SQL test case in ExpressionEncoderSuite fails in windows operation 
> system.
> {code:java}
> Internal error (java.io.FileNotFoundException): 
> D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
>  (文件名、目录名或卷标语法不正确。)
> java.io.FileNotFoundException: 
> D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
>  (文件名、目录名或卷标语法不正确。)
>   at java.base/java.io.FileInputStream.open0(Native Method)
>   at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
>   at java.base/java.io.FileInputStream.(FileInputStream.java:157)
>   at 
> com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211)
>   at 
> org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17)
>   at 
> org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38)
>   at 
> org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222)
>   at 
> com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:842)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47394:
-

Assignee: Kent Yao

> Support TIMESTAMP WITH TIME ZONE for H2Dialect
> --
>
> Key: SPARK-47394
> URL: https://issues.apache.org/jira/browse/SPARK-47394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect

2024-03-14 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47394.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45516
[https://github.com/apache/spark/pull/45516]

> Support TIMESTAMP WITH TIME ZONE for H2Dialect
> --
>
> Key: SPARK-47394
> URL: https://issues.apache.org/jira/browse/SPARK-47394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47208) Allow overriding base overhead memory

2024-03-14 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-47208:
-

Assignee: Joao Correia

> Allow overriding base overhead memory
> -
>
> Key: SPARK-47208
> URL: https://issues.apache.org/jira/browse/SPARK-47208
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 3.5.1
>Reporter: Joao Correia
>Assignee: Joao Correia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We can already select the desired overhead memory directly via the 
> _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not 
> present the overhead memory calculation goes as follows:
> {code:java}
> overhead_memory = Max(384, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor')
> where the 'memoryOverheadFactor' flag defaults to 0.1{code}
> There are certain times where being able to override the 384Mb minimum 
> directly can be beneficial. We may have a scenario where a lot of off-heap 
> operations are performed (ex: using package managers/native 
> compression/decompression) where we don't have a need for a large JVM heap 
> but we may still need a signficant amount of memory in the spark node. 
> Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since 
> we may not want the overhead allocation to directly scale with JVM memory, as 
> a cost saving/resource limitation problem.
> As such, I propose the addition of a 
> 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override 
> the 384Mib value used in the overhead calculation.
> The memory overhead calculation will now be :
> {code:java}
> min_memory = 
> sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384)
> overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor'){code}
> PR: https://github.com/apache/spark/pull/45240  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47388) Pass messageParameters by name to require()

2024-03-14 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47388.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45511
[https://github.com/apache/spark/pull/45511]

> Pass messageParameters by name to require()
> ---
>
> Key: SPARK-47388
> URL: https://issues.apache.org/jira/browse/SPARK-47388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Passing *messageParameters* by value independently from requirement might 
> introduce perf regression. Need to pass *messageParameters* by name to avoid 
> eager instantiation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47208) Allow overriding base overhead memory

2024-03-14 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-47208.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Allow overriding base overhead memory
> -
>
> Key: SPARK-47208
> URL: https://issues.apache.org/jira/browse/SPARK-47208
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core, YARN
>Affects Versions: 3.5.1
>Reporter: Joao Correia
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> We can already select the desired overhead memory directly via the 
> _'spark.driver/executor.memoryOverhead'_ flags, however, if that flag is not 
> present the overhead memory calculation goes as follows:
> {code:java}
> overhead_memory = Max(384, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor')
> where the 'memoryOverheadFactor' flag defaults to 0.1{code}
> There are certain times where being able to override the 384Mb minimum 
> directly can be beneficial. We may have a scenario where a lot of off-heap 
> operations are performed (ex: using package managers/native 
> compression/decompression) where we don't have a need for a large JVM heap 
> but we may still need a signficant amount of memory in the spark node. 
> Using the '{_}memoryOverheadFactor{_}' flag may not prove appropriate. Since 
> we may not want the overhead allocation to directly scale with JVM memory, as 
> a cost saving/resource limitation problem.
> As such, I propose the addition of a 
> 'spark.driver/executor.minMemoryOverhead' flag, which can be used to override 
> the 384Mib value used in the overhead calculation.
> The memory overhead calculation will now be :
> {code:java}
> min_memory = 
> sparkConf.get('spark.driver/executor.minMemoryOverhead').getOrElse(384)
> overhead_memory = Max(min_memory, 'spark.driver/executor.memory' * 
> 'spark.driver/executor.memoryOverheadFactor'){code}
> PR: https://github.com/apache/spark/pull/45240  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47388) Pass messageParameters by name to require()

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47388:
---
Labels: pull-request-available  (was: )

> Pass messageParameters by name to require()
> ---
>
> Key: SPARK-47388
> URL: https://issues.apache.org/jira/browse/SPARK-47388
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> Passing *messageParameters* by value independently from requirement might 
> introduce perf regression. Need to pass *messageParameters* by name to avoid 
> eager instantiation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47336) Provide to PySpark a functionality to get estimated size of DataFrame in bytes

2024-03-14 Thread Semyon Sinchenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827078#comment-17827078
 ] 

Semyon Sinchenko commented on SPARK-47336:
--

[~grundprinzip-db] what do you think about 
`DataFrame.approximate_size_in_bytes() -> float` (or 
`DataFrame.approximateSizeInBytes() -> float`)? Or, for example, 
`DataFrame.approx_size_bytes()` to avoid very long names?

P.S. I wold like to try to implement it, may you assign it on me?

> Provide to PySpark a functionality to get estimated size of DataFrame in bytes
> --
>
> Key: SPARK-47336
> URL: https://issues.apache.org/jira/browse/SPARK-47336
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Semyon Sinchenko
>Priority: Minor
>
> Something equal to 
> sessionState().executePlan(...).optimizedPlan().stats().sizeInBytes() in 
> JVM-Spark. It may be done via simple call of `_jsparkSession` in a regular 
> PySpark and via a plugin for Spark Connect.
>  
> This functionality is useful when one need to check a possibility of 
> broadcast join without modifying global broadcast threshold.
>  
> The function in PySpark API may looks like: 
> `DataFrame.estimate_size_in_bytes() -> float` or 
> `DataFrame.estimateSizeInBytes() -> float`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47336) Provide to PySpark a functionality to get estimated size of DataFrame in bytes

2024-03-14 Thread Martin Grund (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827077#comment-17827077
 ] 

Martin Grund commented on SPARK-47336:
--

I think the general idea is great! I would like to propose to change the name 
to reflect that this is most likely a size estimation though.

> Provide to PySpark a functionality to get estimated size of DataFrame in bytes
> --
>
> Key: SPARK-47336
> URL: https://issues.apache.org/jira/browse/SPARK-47336
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Semyon Sinchenko
>Priority: Minor
>
> Something equal to 
> sessionState().executePlan(...).optimizedPlan().stats().sizeInBytes() in 
> JVM-Spark. It may be done via simple call of `_jsparkSession` in a regular 
> PySpark and via a plugin for Spark Connect.
>  
> This functionality is useful when one need to check a possibility of 
> broadcast join without modifying global broadcast threshold.
>  
> The function in PySpark API may looks like: 
> `DataFrame.estimate_size_in_bytes() -> float` or 
> `DataFrame.estimateSizeInBytes() -> float`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47379) Improve docker jdbc suite test reliability

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47379:
---
Labels: pull-request-available  (was: )

> Improve docker jdbc suite test reliability
> --
>
> Key: SPARK-47379
> URL: https://issues.apache.org/jira/browse/SPARK-47379
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ

2024-03-14 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47390:


Assignee: Kent Yao

> PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
> -
>
> Key: SPARK-47390
> URL: https://issues.apache.org/jira/browse/SPARK-47390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ

2024-03-14 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47390.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45513
[https://github.com/apache/spark/pull/45513]

> PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
> -
>
> Key: SPARK-47390
> URL: https://issues.apache.org/jira/browse/SPARK-47390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47395) Add collate and collation to non-sql APIs

2024-03-14 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-47395:
-

 Summary: Add collate and collation to non-sql APIs
 Key: SPARK-47395
 URL: https://issues.apache.org/jira/browse/SPARK-47395
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47394:
---
Labels: pull-request-available  (was: )

> Support TIMESTAMP WITH TIME ZONE for H2Dialect
> --
>
> Key: SPARK-47394
> URL: https://issues.apache.org/jira/browse/SPARK-47394
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47394) Support TIMESTAMP WITH TIME ZONE for H2Dialect

2024-03-14 Thread Kent Yao (Jira)
Kent Yao created SPARK-47394:


 Summary: Support TIMESTAMP WITH TIME ZONE for H2Dialect
 Key: SPARK-47394
 URL: https://issues.apache.org/jira/browse/SPARK-47394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47393) Collation info should be exposed through system view

2024-03-14 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-47393:


 Summary: Collation info should be exposed through system view
 Key: SPARK-47393
 URL: https://issues.apache.org/jira/browse/SPARK-47393
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47392) Compiler stats should respect collation

2024-03-14 Thread Aleksandar Tomic (Jira)
Aleksandar Tomic created SPARK-47392:


 Summary: Compiler stats should respect collation
 Key: SPARK-47392
 URL: https://issues.apache.org/jira/browse/SPARK-47392
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Aleksandar Tomic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47391) Remove the test case workaround for JDK 8

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47391:
---
Labels: pull-request-available  (was: )

> Remove the test case workaround for JDK 8
> -
>
> Key: SPARK-47391
> URL: https://issues.apache.org/jira/browse/SPARK-47391
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Spark SQL test case in ExpressionEncoderSuite fails in windows operation 
> system.
> {code:java}
> Internal error (java.io.FileNotFoundException): 
> D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
>  (文件名、目录名或卷标语法不正确。)
> java.io.FileNotFoundException: 
> D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
>  (文件名、目录名或卷标语法不正确。)
>   at java.base/java.io.FileInputStream.open0(Native Method)
>   at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
>   at java.base/java.io.FileInputStream.(FileInputStream.java:157)
>   at 
> com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211)
>   at 
> org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18)
>   at scala.Option.getOrElse(Option.scala:201)
>   at 
> org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17)
>   at 
> org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38)
>   at 
> org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163)
>   at 
> org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222)
>   at 
> com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218)
>   at 
> com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
>   at java.base/java.lang.Thread.run(Thread.java:842)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47391) Remove the test case workaround for JDK 8

2024-03-14 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-47391:
--

 Summary: Remove the test case workaround for JDK 8
 Key: SPARK-47391
 URL: https://issues.apache.org/jira/browse/SPARK-47391
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng


Spark SQL test case in ExpressionEncoderSuite fails in windows operation system.

{code:java}
Internal error (java.io.FileNotFoundException): 
D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
 (文件名、目录名或卷标语法不正确。)
java.io.FileNotFoundException: 
D:\Users\gja\git-forks\spark\sql\catalyst\target\scala-2.13\test-classes\org\apache\spark\sql\catalyst\encoders\ExpressionEncoderSuite$OuterLevelWithVeryVeryVeryLongClassName1$OuterLevelWithVeryVeryVeryLongClassName2$OuterLevelWithVeryVeryVeryLongClassName3$OuterLevelWithVeryVeryVeryLongClassName4$OuterLevelWithVeryVeryVeryLongClassName5$OuterLevelWithVeryVeryVeryLongClassName6$.class
 (文件名、目录名或卷标语法不正确。)
at java.base/java.io.FileInputStream.open0(Native Method)
at java.base/java.io.FileInputStream.open(FileInputStream.java:216)
at java.base/java.io.FileInputStream.(FileInputStream.java:157)
at 
com.intellij.openapi.util.io.FileUtil.loadFileBytes(FileUtil.java:211)
at 
org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.$anonfun$getContent$1(LazyCompiledClass.scala:18)
at scala.Option.getOrElse(Option.scala:201)
at 
org.jetbrains.jps.incremental.scala.local.LazyCompiledClass.getContent(LazyCompiledClass.scala:17)
at 
org.jetbrains.jps.incremental.instrumentation.BaseInstrumentingBuilder.performBuild(BaseInstrumentingBuilder.java:38)
at 
org.jetbrains.jps.incremental.instrumentation.ClassProcessingBuilder.build(ClassProcessingBuilder.java:80)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.runModuleLevelBuilders(IncProjectBuilder.java:1569)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.runBuildersForChunk(IncProjectBuilder.java:1198)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.buildTargetsChunk(IncProjectBuilder.java:1349)
at 
org.jetbrains.jps.incremental.IncProjectBuilder.buildChunkIfAffected(IncProjectBuilder.java:1163)
at 
org.jetbrains.jps.incremental.IncProjectBuilder$BuildParallelizer$1.run(IncProjectBuilder.java:1129)
at 
com.intellij.util.concurrency.BoundedTaskExecutor.doRun(BoundedTaskExecutor.java:244)
at 
com.intellij.util.concurrency.BoundedTaskExecutor.access$200(BoundedTaskExecutor.java:30)
at 
com.intellij.util.concurrency.BoundedTaskExecutor$1.executeFirstTaskAndHelpQueue(BoundedTaskExecutor.java:222)
at 
com.intellij.util.ConcurrencyUtil.runUnderThreadName(ConcurrencyUtil.java:218)
at 
com.intellij.util.concurrency.BoundedTaskExecutor$1.run(BoundedTaskExecutor.java:210)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:842)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47390:
---
Labels: pull-request-available  (was: )

> PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
> -
>
> Key: SPARK-47390
> URL: https://issues.apache.org/jira/browse/SPARK-47390
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47390) PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ

2024-03-14 Thread Kent Yao (Jira)
Kent Yao created SPARK-47390:


 Summary: PostgresDialect distinguishes TIMESTAMP from TIMESTAMP_TZ
 Key: SPARK-47390
 URL: https://issues.apache.org/jira/browse/SPARK-47390
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47389) spark jdbc one insert with multiple values

2024-03-14 Thread melin (Jira)
melin created SPARK-47389:
-

 Summary: spark jdbc one insert with multiple values
 Key: SPARK-47389
 URL: https://issues.apache.org/jira/browse/SPARK-47389
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 4.0.0
Reporter: melin


Many databases support a single insert sql to write multiple rows of data. 
Write performance is more efficient than batch execution of multiple sql files

 

https://github.com/apache/spark/blob/9986462811f160eacd766da8a4e14a9cbb4b8710/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L725

 

example:

 
{code:java}
INSERT INTO Customers (Name, Age, Active) ('Name1',21,1) INSERT INTO Customers 
(Name, Age, Active) ('Name2',21,1)
Vs
INSERT INTO Customers (Name, Age, Active) ('Name1',21,1), ('Name2',21,1)
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47388) Pass messageParameters by name to require()

2024-03-14 Thread Max Gekk (Jira)
Max Gekk created SPARK-47388:


 Summary: Pass messageParameters by name to require()
 Key: SPARK-47388
 URL: https://issues.apache.org/jira/browse/SPARK-47388
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Passing *messageParameters* by value independently from requirement might 
introduce perf regression. Need to pass *messageParameters* by name to avoid 
eager instantiation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47385) Tuple encoder produces wrong results with Option inputs

2024-03-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47385.
-
Fix Version/s: 3.4.3
   3.5.2
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 45508
[https://github.com/apache/spark/pull/45508]

> Tuple encoder produces wrong results with Option inputs
> ---
>
> Key: SPARK-47385
> URL: https://issues.apache.org/jira/browse/SPARK-47385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.4
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.3, 3.5.2, 4.0.0
>
>
>  
> The behavior of tupled encoders on the Option type was changed by 
> https://github.com/apache/spark/pull/40755.
> {code:java}
> import org.apache.spark.sql.{Encoders, Encoder} 
> case class Required(name: String) 
> case class Optional(name: String) 
> implicit val enc: Encoder[(Required, Option[Optional])] = 
> Encoders.tuple(Encoders.product[Required], 
> Encoders.product[Option[Optional]]) 
>  
> spark.createDataFrame(Seq( 
> (Required("1"), Some(Optional("1"))), 
> (Required("2"), None) 
> )).as[(Required, Option[Optional])].collect(){code}
> Before the PR, the result is:
> {code:java}
> Array((Required(1),Some(Optional(1))), (Required(2),None)){code}
> After the PR, the result is:
> {code:java}
> Array((Required(1),Some(Optional(1))), (Required(2),null)) {code}
> which is incorrect because the original input is None rather than null.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47385) Tuple encoder produces wrong results with Option inputs

2024-03-14 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47385:
---

Assignee: Chenhao Li

> Tuple encoder produces wrong results with Option inputs
> ---
>
> Key: SPARK-47385
> URL: https://issues.apache.org/jira/browse/SPARK-47385
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.4
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>  Labels: pull-request-available
>
>  
> The behavior of tupled encoders on the Option type was changed by 
> https://github.com/apache/spark/pull/40755.
> {code:java}
> import org.apache.spark.sql.{Encoders, Encoder} 
> case class Required(name: String) 
> case class Optional(name: String) 
> implicit val enc: Encoder[(Required, Option[Optional])] = 
> Encoders.tuple(Encoders.product[Required], 
> Encoders.product[Option[Optional]]) 
>  
> spark.createDataFrame(Seq( 
> (Required("1"), Some(Optional("1"))), 
> (Required("2"), None) 
> )).as[(Required, Option[Optional])].collect(){code}
> Before the PR, the result is:
> {code:java}
> Array((Required(1),Some(Optional(1))), (Required(2),None)){code}
> After the PR, the result is:
> {code:java}
> Array((Required(1),Some(Optional(1))), (Required(2),null)) {code}
> which is incorrect because the original input is None rather than null.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47387) Remove some unused error classes

2024-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47387:
---
Labels: pull-request-available  (was: )

> Remove some unused error classes
> 
>
> Key: SPARK-47387
> URL: https://issues.apache.org/jira/browse/SPARK-47387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47374) Fix connect-repl `usage prompt` & `docs link`

2024-03-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47374.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45494
[https://github.com/apache/spark/pull/45494]

> Fix connect-repl `usage prompt` & `docs link`
> -
>
> Key: SPARK-47374
> URL: https://issues.apache.org/jira/browse/SPARK-47374
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47374) Fix connect-repl `usage prompt` & `docs link`

2024-03-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47374:


Assignee: BingKun Pan

> Fix connect-repl `usage prompt` & `docs link`
> -
>
> Key: SPARK-47374
> URL: https://issues.apache.org/jira/browse/SPARK-47374
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47377) Factor out tests from `SparkConnectSQLTestCase`

2024-03-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47377:


Assignee: Ruifeng Zheng

> Factor out tests from `SparkConnectSQLTestCase`
> ---
>
> Key: SPARK-47377
> URL: https://issues.apache.org/jira/browse/SPARK-47377
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47377) Factor out tests from `SparkConnectSQLTestCase`

2024-03-14 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47377.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45497
[https://github.com/apache/spark/pull/45497]

> Factor out tests from `SparkConnectSQLTestCase`
> ---
>
> Key: SPARK-47377
> URL: https://issues.apache.org/jira/browse/SPARK-47377
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47387) Remove some unused error classes

2024-03-14 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47387:
---

 Summary: Remove some unused error classes
 Key: SPARK-47387
 URL: https://issues.apache.org/jira/browse/SPARK-47387
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org