date:20231218

[jira] [Updated] (SPARK-46453) SessionHolder doesn't throw exceptions from internalError()

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46453:
---
Labels: pull-request-available  (was: )

> SessionHolder doesn't throw exceptions from internalError()
> ---
>
> Key: SPARK-46453
> URL: https://issues.apache.org/jira/browse/SPARK-46453
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> Need to throw SparkException returned by internalError in SessionHolder 
> otherwise users won't see the internal error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46453) SessionHolder doesn't throw exceptions from internalError()

2023-12-18 Thread Max Gekk (Jira)

Max Gekk created SPARK-46453:


 Summary: SessionHolder doesn't throw exceptions from 
internalError()
 Key: SPARK-46453
 URL: https://issues.apache.org/jira/browse/SPARK-46453
 Project: Spark
  Issue Type: Bug
  Components: Connect, SQL
Affects Versions: 3.5.0, 4.0.0
Reporter: Max Gekk
Assignee: Max Gekk


Need to throw SparkException returned by internalError in SessionHolder 
otherwise users won't see the internal error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46452:
---
Labels: pull-request-available  (was: )

> Add a new API in DSv2 DataWriter to write an iterator of records
> 
>
> Key: SPARK-46452
> URL: https://issues.apache.org/jira/browse/SPARK-46452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new API that takes an iterator of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records

2023-12-18 Thread Allison Wang (Jira)

Allison Wang created SPARK-46452:


 Summary: Add a new API in DSv2 DataWriter to write an iterator of 
records
 Key: SPARK-46452
 URL: https://issues.apache.org/jira/browse/SPARK-46452
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a new API that takes an iterator of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46451) Reorganize `GroupbyStatTests`

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46451:
---
Labels: pull-request-available  (was: )

> Reorganize `GroupbyStatTests`
> -
>
> Key: SPARK-46451
> URL: https://issues.apache.org/jira/browse/SPARK-46451
> Project: Spark
>  Issue Type: Test
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46451) Reorganize `GroupbyStatTests`

2023-12-18 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-46451:
-

 Summary: Reorganize `GroupbyStatTests`
 Key: SPARK-46451
 URL: https://issues.apache.org/jira/browse/SPARK-46451
 Project: Spark
  Issue Type: Test
  Components: PS, Tests
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-46446.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44401
[https://github.com/apache/spark/pull/44401]

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-46446:
---

Assignee: Jack Chen

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Assignee: Jack Chen
>Priority: Major
>  Labels: pull-request-available
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46308) Forbid recursive error handling by adding recursion guards

2023-12-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-46308:


Assignee: Alice Sayutina

> Forbid recursive error handling by adding recursion guards
> --
>
> Key: SPARK-46308
> URL: https://issues.apache.org/jira/browse/SPARK-46308
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46308) Forbid recursive error handling by adding recursion guards

2023-12-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46308.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44210
[https://github.com/apache/spark/pull/44210]

> Forbid recursive error handling by adding recursion guards
> --
>
> Key: SPARK-46308
> URL: https://issues.apache.org/jira/browse/SPARK-46308
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Alice Sayutina
>Assignee: Alice Sayutina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function

2023-12-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798381#comment-17798381
 ] 

Jungtaek Lim commented on SPARK-46450:
--

It's a missing one and maybe we will have to document - session window is only 
working properly with batch/streaming aggregation. If you use it as normal 
function and not ingesting the value to aggregation, merging sessions is never 
triggered.

> session_window doesn't identify sessions with provided gap when used as a 
> window function
> -
>
> Key: SPARK-46450
> URL: https://issues.apache.org/jira/browse/SPARK-46450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> {{PARTITION BY session_window}} doesn't produce the expected results. Here's 
> an example:
> {code:sql}
> SELECT 
>   id,
>   ts,
>   collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as 
> window_ids
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
> {code}
> Actual result:
> {code:java}
> +---++--+
> |id |ts  |window_ids|
> +---++--+
> |1  |2023-12-11 01:10|[1]   |
> |2  |2023-12-11 01:15|[2]   |
> |3  |2023-12-11 01:40|[3]   |
> |4  |2023-12-11 02:05|[4]   |
> |5  |2023-12-11 03:15|[5]   |
> |6  |2023-12-11 03:20|[6]   |
> |7  |2023-12-11 04:10|[7]   |
> |8  |2023-12-11 05:05|[8]   |
> +---++--+
> {code}
> Expected result, assigning rows to two sessions with 1-hour gap:
> {code:java}
> +---+++
> |id |ts  |window_ids  |
> +---+++
> |1  |2023-12-11 01:10|[1, 2, 3, 4]|
> |2  |2023-12-11 01:15|[1, 2, 3, 4]|
> |3  |2023-12-11 01:40|[1, 2, 3, 4]|
> |4  |2023-12-11 02:05|[1, 2, 3, 4]|
> |5  |2023-12-11 03:15|[5, 6, 7, 8]|
> |6  |2023-12-11 03:20|[5, 6, 7, 8]|
> |7  |2023-12-11 04:10|[5, 6, 7, 8]|
> |8  |2023-12-11 05:05|[5, 6, 7, 8]|
> +---+++
> {code}
> I compared its behavior with the results as a grouping function and with how 
> {{window()}} behaves in both cases, which seems to confirm that the result is 
> inconsistent. Here are the other examples:
> *{{group by window()}}*
> {code:sql}
> SELECT 
>   collect_list(id) AS ids,
>   collect_list(ts) AS tss,
>   window
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
> GROUP by window(ts, '1 hour')
> {code}
> Correctly assigns rows to 1-hour windows:
> {code:java}
> +-+--+--+
> |ids  |tss   |window  
>   |
> +-+--+--+
> |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 
> 01:00:00, 2023-12-11 02:00:00}|
> |[4]  |[2023-12-11 02:05]|{2023-12-11 
> 02:00:00, 2023-12-11 03:00:00}|
> |[5, 6]   |[2023-12-11 03:15, 2023-12-11 03:20]  |{2023-12-11 
> 03:00:00, 2023-12-11 04:00:00}|
> |[7]  |[2023-12-11 04:10]|{2023-12-11 
> 04:00:00, 2023-12-11 05:00:00}|
> |[8]  |[2023-12-11 05:05]|{2023-12-11 
> 05:00:00, 2023-12-11 06:00:00}|
> +-+--+--+
> {code}
>  
> *{{group by session_window()}}*
> {code:sql}
> SELECT 
>   collect_list(id) AS ids,
>   collect_list(ts) AS tss,
>   session_window
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
> GROUP by session_window(ts, '1 hour')
> {code}
> Correctly assigns rows to two sessions with 1-hour gap:
> {code:java}
> +++--+
> |ids |tss 
> |session_window

[jira] [Comment Edited] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function

2023-12-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-46450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798381#comment-17798381
 ] 

Jungtaek Lim edited comment on SPARK-46450 at 12/18/23 11:37 PM:
-

It's a missing one and maybe we will have to document - session window is only 
working properly with batch/streaming aggregation. (Because it requires custom 
logic to merge sessions.) If you use it as normal function and not ingesting 
the value to aggregation, merging sessions is never triggered.


was (Author: kabhwan):
It's a missing one and maybe we will have to document - session window is only 
working properly with batch/streaming aggregation. If you use it as normal 
function and not ingesting the value to aggregation, merging sessions is never 
triggered.

> session_window doesn't identify sessions with provided gap when used as a 
> window function
> -
>
> Key: SPARK-46450
> URL: https://issues.apache.org/jira/browse/SPARK-46450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Juan Pumarino
>Priority: Minor
>
> {{PARTITION BY session_window}} doesn't produce the expected results. Here's 
> an example:
> {code:sql}
> SELECT 
>   id,
>   ts,
>   collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as 
> window_ids
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
> {code}
> Actual result:
> {code:java}
> +---++--+
> |id |ts  |window_ids|
> +---++--+
> |1  |2023-12-11 01:10|[1]   |
> |2  |2023-12-11 01:15|[2]   |
> |3  |2023-12-11 01:40|[3]   |
> |4  |2023-12-11 02:05|[4]   |
> |5  |2023-12-11 03:15|[5]   |
> |6  |2023-12-11 03:20|[6]   |
> |7  |2023-12-11 04:10|[7]   |
> |8  |2023-12-11 05:05|[8]   |
> +---++--+
> {code}
> Expected result, assigning rows to two sessions with 1-hour gap:
> {code:java}
> +---+++
> |id |ts  |window_ids  |
> +---+++
> |1  |2023-12-11 01:10|[1, 2, 3, 4]|
> |2  |2023-12-11 01:15|[1, 2, 3, 4]|
> |3  |2023-12-11 01:40|[1, 2, 3, 4]|
> |4  |2023-12-11 02:05|[1, 2, 3, 4]|
> |5  |2023-12-11 03:15|[5, 6, 7, 8]|
> |6  |2023-12-11 03:20|[5, 6, 7, 8]|
> |7  |2023-12-11 04:10|[5, 6, 7, 8]|
> |8  |2023-12-11 05:05|[5, 6, 7, 8]|
> +---+++
> {code}
> I compared its behavior with the results as a grouping function and with how 
> {{window()}} behaves in both cases, which seems to confirm that the result is 
> inconsistent. Here are the other examples:
> *{{group by window()}}*
> {code:sql}
> SELECT 
>   collect_list(id) AS ids,
>   collect_list(ts) AS tss,
>   window
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
> GROUP by window(ts, '1 hour')
> {code}
> Correctly assigns rows to 1-hour windows:
> {code:java}
> +-+--+--+
> |ids  |tss   |window  
>   |
> +-+--+--+
> |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 
> 01:00:00, 2023-12-11 02:00:00}|
> |[4]  |[2023-12-11 02:05]|{2023-12-11 
> 02:00:00, 2023-12-11 03:00:00}|
> |[5, 6]   |[2023-12-11 03:15, 2023-12-11 03:20]  |{2023-12-11 
> 03:00:00, 2023-12-11 04:00:00}|
> |[7]  |[2023-12-11 04:10]|{2023-12-11 
> 04:00:00, 2023-12-11 05:00:00}|
> |[8]  |[2023-12-11 05:05]|{2023-12-11 
> 05:00:00, 2023-12-11 06:00:00}|
> +-+--+--+
> {code}
>  
> *{{group by session_window()}}*
> {code:sql}
> SELECT 
>   collect_list(id) AS ids,
>   collect_list(ts) AS tss,
>   session_window
> FROM VALUES
>   (1, "2023-12-11 01:10"),
>   (2, "2023-12-11 01:15"),
>   (3, "2023-12-11 01:40"),
>   (4, "2023-12-11 02:05"),
>   (5, "2023-12-11 03:15"),
>   (6, "2023-12-11 03:20"),
>   (7, "2023-12-11 04:10"),
>   (8, "2023-12-11 05:05")
>   AS tab(id, ts)
>

[jira] [Created] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function

2023-12-18 Thread Juan Pumarino (Jira)

Juan Pumarino created SPARK-46450:
-

 Summary: session_window doesn't identify sessions with provided 
gap when used as a window function
 Key: SPARK-46450
 URL: https://issues.apache.org/jira/browse/SPARK-46450
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0, 3.4.1
Reporter: Juan Pumarino


{{PARTITION BY session_window}} doesn't produce the expected results. Here's an 
example:
{code:sql}
SELECT 
  id,
  ts,
  collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as 
window_ids

FROM VALUES
  (1, "2023-12-11 01:10"),
  (2, "2023-12-11 01:15"),
  (3, "2023-12-11 01:40"),
  (4, "2023-12-11 02:05"),
  (5, "2023-12-11 03:15"),
  (6, "2023-12-11 03:20"),
  (7, "2023-12-11 04:10"),
  (8, "2023-12-11 05:05")
  AS tab(id, ts)
{code}
Actual result:
{code:java}
+---++--+
|id |ts  |window_ids|
+---++--+
|1  |2023-12-11 01:10|[1]   |
|2  |2023-12-11 01:15|[2]   |
|3  |2023-12-11 01:40|[3]   |
|4  |2023-12-11 02:05|[4]   |
|5  |2023-12-11 03:15|[5]   |
|6  |2023-12-11 03:20|[6]   |
|7  |2023-12-11 04:10|[7]   |
|8  |2023-12-11 05:05|[8]   |
+---++--+
{code}
Expected result, assigning rows to two sessions with 1-hour gap:
{code:java}
+---+++
|id |ts  |window_ids  |
+---+++
|1  |2023-12-11 01:10|[1, 2, 3, 4]|
|2  |2023-12-11 01:15|[1, 2, 3, 4]|
|3  |2023-12-11 01:40|[1, 2, 3, 4]|
|4  |2023-12-11 02:05|[1, 2, 3, 4]|
|5  |2023-12-11 03:15|[5, 6, 7, 8]|
|6  |2023-12-11 03:20|[5, 6, 7, 8]|
|7  |2023-12-11 04:10|[5, 6, 7, 8]|
|8  |2023-12-11 05:05|[5, 6, 7, 8]|
+---+++
{code}
I compared its behavior with the results as a grouping function and with how 
{{window()}} behaves in both cases, which seems to confirm that the result is 
inconsistent. Here are the other examples:

*{{group by window()}}*
{code:sql}
SELECT 
  collect_list(id) AS ids,
  collect_list(ts) AS tss,
  window

FROM VALUES
  (1, "2023-12-11 01:10"),
  (2, "2023-12-11 01:15"),
  (3, "2023-12-11 01:40"),
  (4, "2023-12-11 02:05"),
  (5, "2023-12-11 03:15"),
  (6, "2023-12-11 03:20"),
  (7, "2023-12-11 04:10"),
  (8, "2023-12-11 05:05")
  AS tab(id, ts)

GROUP by window(ts, '1 hour')
{code}
Correctly assigns rows to 1-hour windows:
{code:java}
+-+--+--+
|ids  |tss   |window
|
+-+--+--+
|[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 
01:00:00, 2023-12-11 02:00:00}|
|[4]  |[2023-12-11 02:05]|{2023-12-11 
02:00:00, 2023-12-11 03:00:00}|
|[5, 6]   |[2023-12-11 03:15, 2023-12-11 03:20]  |{2023-12-11 
03:00:00, 2023-12-11 04:00:00}|
|[7]  |[2023-12-11 04:10]|{2023-12-11 
04:00:00, 2023-12-11 05:00:00}|
|[8]  |[2023-12-11 05:05]|{2023-12-11 
05:00:00, 2023-12-11 06:00:00}|
+-+--+--+
{code}
 

*{{group by session_window()}}*
{code:sql}
SELECT 
  collect_list(id) AS ids,
  collect_list(ts) AS tss,
  session_window

FROM VALUES
  (1, "2023-12-11 01:10"),
  (2, "2023-12-11 01:15"),
  (3, "2023-12-11 01:40"),
  (4, "2023-12-11 02:05"),
  (5, "2023-12-11 03:15"),
  (6, "2023-12-11 03:20"),
  (7, "2023-12-11 04:10"),
  (8, "2023-12-11 05:05")
  AS tab(id, ts)

GROUP by session_window(ts, '1 hour')
{code}
Correctly assigns rows to two sessions with 1-hour gap:
{code:java}
+++--+
|ids |tss   
  |session_window|
+++--+
|[1, 2, 3, 4]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40, 2023-12-11 
02:05]|{2023-12-11 01:10:00, 2023-12-11 03:05:00}|
|[5, 6, 7, 8]|[2023-12-11 03:15, 2023-12-11 03:20, 2023-12-11 04:10, 2023-12-11 
05:05]|{2023-12-11 03:15:00, 2023-12-11 06:05:00}|
+++--+
{code}
 

*{{partition by window()}}*
{code:sql}
SELECT 
  id,
  ts,
  collect_list(id) OVER (PARTITION BY window(ts, '1 hour')) as window_ids

FROM VALUES
  (1, "2023-12-11 01:10"),
  (2, "2023-12-11 01:15"),

[jira] [Resolved] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46411.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44359
[https://github.com/apache/spark/pull/44359]

> Change to use bcprov/bcpkix-jdk18on for test
> 
>
> Key: SPARK-46411
> URL: https://issues.apache.org/jira/browse/SPARK-46411
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46411:
-

Assignee: Yang Jie

> Change to use bcprov/bcpkix-jdk18on for test
> 
>
> Key: SPARK-46411
> URL: https://issues.apache.org/jira/browse/SPARK-46411
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46447) Remove the legacy datetime rebasing SQL configs

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46447:
---
Labels: pull-request-available  (was: )

> Remove the legacy datetime rebasing SQL configs
> ---
>
> Key: SPARK-46447
> URL: https://issues.apache.org/jira/browse/SPARK-46447
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Minor
>  Labels: pull-request-available
>
> Remove already deprecated SQL configs (alternatives to other configs):
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> - spark.sql.legacy.parquet.datetimeRebaseModeInRead
> - spark.sql.legacy.avro.datetimeRebaseModeInWrite
> - spark.sql.legacy.avro.datetimeRebaseModeInRead



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46439) Move IO-related tests to `pyspark.pandas.tests.io.*`

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46439.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44395
[https://github.com/apache/spark/pull/44395]

> Move IO-related tests to `pyspark.pandas.tests.io.*`
> 
>
> Key: SPARK-46439
> URL: https://issues.apache.org/jira/browse/SPARK-46439
> Project: Spark
>  Issue Type: Test
>  Components: PS, Tests
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46420) Remove unused transport from SparkSQLCLIDriver

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46420.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44370
[https://github.com/apache/spark/pull/44370]

> Remove unused transport from SparkSQLCLIDriver
> --
>
> Key: SPARK-46420
> URL: https://issues.apache.org/jira/browse/SPARK-46420
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-46289:
-

Assignee: Bruce Robbins

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
>  Labels: pull-request-available
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46289) Exception when ordering by UDT in interpreted mode

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46289.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44361
[https://github.com/apache/spark/pull/44361]

> Exception when ordering by UDT in interpreted mode
> --
>
> Key: SPARK-46289
> URL: https://issues.apache.org/jira/browse/SPARK-46289
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.2, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> In interpreted mode, ordering by a UDT will result in an exception. For 
> example:
> {noformat}
> import org.apache.spark.ml.linalg.{DenseVector, Vector}
> val df = Seq.tabulate(30) { x =>
>   (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + 
> 1)/100.0).toDouble, ((x + 3)/100.0).toDouble)))
> }.toDF("id", "c1", "c2", "c3")
> df.createOrReplaceTempView("df")
> // this works
> sql("select * from df order by c3").collect
> sql("set spark.sql.codegen.wholeStage=false")
> sql("set spark.sql.codegen.factoryMode=NO_CODEGEN")
> // this gets an error
> sql("select * from df order by c3").collect
> {noformat}
> The second {{collect}} action results in the following exception:
> {noformat}
> org.apache.spark.SparkIllegalArgumentException: Type 
> UninitializedPhysicalType does not support ordered operations.
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332)
>   at 
> org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254)
> {noformat}
> Note: You don't get an error if you use {{show}} rather than {{collect}}. 
> This is because {{show}} will implicitly add a {{limit}}, in which case the 
> ordering is performed by {{TakeOrderedAndProject}} rather than 
> {{UnsafeExternalRowSorter}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-46438) Fix the scope of toggleFlamegraph and improve test

2023-12-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-46438.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44394
[https://github.com/apache/spark/pull/44394]

> Fix the scope of toggleFlamegraph and improve test
> --
>
> Key: SPARK-46438
> URL: https://issues.apache.org/jira/browse/SPARK-46438
> Project: Spark
>  Issue Type: Sub-task
>  Components: UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46449) Add ability to create databases via Catalog API

2023-12-18 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-46449:


 Summary: Add ability to create databases via Catalog API
 Key: SPARK-46449
 URL: https://issues.apache.org/jira/browse/SPARK-46449
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Nicholas Chammas


As of Spark 3.5, the only way to create a database is via SQL. The Catalog API 
should offer an equivalent.

Perhaps something like:
{code:python}
spark.catalog.createDatabase(
name: str,
existsOk: bool = False,
comment: str = None,
location: str = None,
properties: dict = None,
)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46448) InlineTable column should be nullable when converted to Decimal might cause null

2023-12-18 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-46448:
-
Summary: InlineTable column should be nullable when converted to Decimal 
might cause null  (was: InlineTable column should be nullable when converted to 
Decimal and overflow)

> InlineTable column should be nullable when converted to Decimal might cause 
> null
> 
>
> Key: SPARK-46448
> URL: https://issues.apache.org/jira/browse/SPARK-46448
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46448) InlineTable column should be nullable when converted to Decimal and overflow

2023-12-18 Thread Rui Wang (Jira)

Rui Wang created SPARK-46448:


 Summary: InlineTable column should be nullable when converted to 
Decimal and overflow
 Key: SPARK-46448
 URL: https://issues.apache.org/jira/browse/SPARK-46448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Rui Wang
Assignee: Rui Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46446:
---
Labels: pull-request-available  (was: )

> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>  Labels: pull-request-available
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-46446:
--
Description: 
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice. (The bug was initially 
found by https://github.com/apache/spark/pull/44084)

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support

  was:
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support


> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice. (The bug was initially 
> found by https://github.com/apache/spark/pull/44084)
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jack Chen updated SPARK-46446:
--
Description: 
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

 

It's easy to repro with a query like
{code:java}
create table x(x1 int, x2 int);
insert into x values (1, 1), (2, 2);
create table y(y1 int, y2 int);
insert into y values (1, 1), (1, 2), (2, 4);


select * from x where exists (select * from y where x1 = y1 limit 1 offset 
2){code}
Correct result: empty set, see postgres: 
[https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 

Spark result: Array([2,2])

 

The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support

  was:
Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like
{code:java}
SELECT * 
FROM   emp 
join lateral   (SELECT dept.dept_name
               FROM   dept 
               WHERE  emp.dept_id = dept.dept_id
               LIMIT 5 OFFSET 3); {code}
The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support


> Correctness bug in correlated subquery with OFFSET
> --
>
> Key: SPARK-46446
> URL: https://issues.apache.org/jira/browse/SPARK-46446
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jack Chen
>Priority: Major
>
> Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
> introduced recently when support for correlation under OFFSET was enabled but 
> were not handled correctly. (So we went from unsupported, query throws error 
> -> wrong results.)
> It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS
>  
> It's easy to repro with a query like
> {code:java}
> create table x(x1 int, x2 int);
> insert into x values (1, 1), (2, 2);
> create table y(y1 int, y2 int);
> insert into y values (1, 1), (1, 2), (2, 4);
> select * from x where exists (select * from y where x1 = y1 limit 1 offset 
> 2){code}
> Correct result: empty set, see postgres: 
> [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] 
> Spark result: Array([2,2])
>  
> The 
> [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
>  where it was introduced added a test for it, but the golden file results for 
> the test actually were incorrect and we didn't notice.
> I'll work on both:
>  * Adding support for offset in DecorrelateInnerQuery (the transformation is 
> into a filter on row_number window function, similar to limit).
>  * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"

2023-12-18 Thread Wan Kun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798232#comment-17798232
 ] 

Wan Kun commented on SPARK-33144:
-

Speculative task also can cause this issue:

 
Spark Driver Log

23/12/17 11:28:10 INFO TaskSetManager: Starting task 39.1 in stage 5.0 (TID 
154) (node-013.com, executor 18, partition 39, PROCESS_LOCAL, 7317 bytes) 
taskResourceAssignments Map()
23/12/17 11:28:11 INFO TaskSetManager: Killing attempt 1 for task 39.1 in stage 
5.0 (TID 154) on node-013.com as the attempt 0 succeeded on 
hdc42-mcc10-01-0210-6008-022.com
23/12/17 11:28:11 INFO DAGScheduler: ResultStage 5 (main at 
NativeMethodAccessorImpl.java:0) finished in 2.973 s
23/12/17 11:28:11 INFO DAGScheduler: Job 2 is finished. Cancelling potential 
speculative or zombie tasks for this job
23/12/17 11:28:11 INFO YarnClusterScheduler: Killing all running tasks in stage 
5: Stage finished
23/12/17 11:28:11 INFO DAGScheduler: Job 2 finished: main at 
NativeMethodAccessorImpl.java:0, took 2.992123 s
23/12/17 11:28:11 INFO FileFormatWriter: Start to commit write Job 
9797dcfc-fa17-47ce-907c-97932130f3d3.
23/12/17 11:28:11 INFO FileFormatWriter: Write Job 
9797dcfc-fa17-47ce-907c-97932130f3d3 committed. Elapsed time: 56 ms.
//
// delete ["", "user", "hive", 
"warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary"]
// and create ["", "user", "hive", 
"warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_SUCCESS"]
//
23/12/17 11:28:11 INFO FileFormatWriter: Finished processing stats for write 
job 9797dcfc-fa17-47ce-907c-97932130f3d3.
23/12/17 11:28:12 WARN TaskSetManager: Lost task 39.1 in stage 5.0 (TID 154) 
(node-013.com executor 18): TaskKilled (Stage finished)
23/12/17 11:28:12 INFO TaskSetManager: task 39.1 in stage 5.0 (TID 154) failed, 
but the task will not be re-executed (either because the task failed with a 
shuffle data fetch failure, so the previous stage needs to be re-run, or 
because a different copy of the task has already succeeded).
23/12/17 11:28:12 INFO YarnClusterScheduler: Removed TaskSet 5.0, whose tasks 
have all completed, from pool
...
23/12/17 11:28:14 INFO Hive: New loading path = 
viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0
 with partSpec \{week_beg_dt=, ind=}
...
23/12/17 11:28:14 ERROR Hive: Exception when loading partition with parameters 
partPath=viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0,
 table=gross_chrn_byr_srw, partSpec=\{week_beg_dt=, ind=}, replace=true, 
listBucketingEnabled=false, isAcid=false, hasFollowingStatsTask=false
org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key 
week_beg_dt is null or empty
at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2240)
at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2188)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1618)
at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929)
at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


task 39.1 in stage 5.0 (TID 154) log:

Although the task is killed, the parquet file is still created.

2023-12-17 11:28:13 /10.216.139.145 create ["", "user", "hive", 
"warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary","0","_temporary","attempt_202312171128088986155311074942191_0005_m_39_154","week_beg_dt=2023-09-24","ind=T4","part-00039-cd9174aa-8010-41fa-8fbe-46d01d2b1efb.c000"]

Delete subdirectory $OUTPUT_DIR/_temporary/0/xxx and caused directory 
$OUTPUT_DIR/_temporary/0 leak:
2023-12-17 11:28:13 /10.216.139.145 delete ["", "user", "hive", 
"warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary","0","_temporary","attempt_202312171128088986155311074942191_0005_m_39_154"]


23/12/17 11:28:12 INFO ParquetRecordWriterWrapper: creating real writer to 
write at 
viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0/_temporary/attempt_202312171128088986155311074942191_0005_m_39_154/week_beg_dt=2023-09-24/ind=T4/part-00039-cd9174aa-8010-41fa-8fbe-46d01d2b1efb.c000
23/12/17 11:28:12 ERROR Utils: Aborting task
org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.InterruptedIOException: Interrupted waiting to send RPC request to 
server
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:274)
at

[jira] [Created] (SPARK-46446) Correctness bug in correlated subquery with OFFSET

2023-12-18 Thread Jack Chen (Jira)

Jack Chen created SPARK-46446:
-

 Summary: Correctness bug in correlated subquery with OFFSET
 Key: SPARK-46446
 URL: https://issues.apache.org/jira/browse/SPARK-46446
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jack Chen


Subqueries with correlation under LIMIT with OFFSET have a correctness bug, 
introduced recently when support for correlation under OFFSET was enabled but 
were not handled correctly. (So we went from unsupported, query throws error -> 
wrong results.)

It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS

It's easy to repro with a query like
{code:java}
SELECT * 
FROM   emp 
join lateral   (SELECT dept.dept_name
               FROM   dept 
               WHERE  emp.dept_id = dept.dept_id
               LIMIT 5 OFFSET 3); {code}
The 
[PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403]
 where it was introduced added a test for it, but the golden file results for 
the test actually were incorrect and we didn't notice.

I'll work on both:
 * Adding support for offset in DecorrelateInnerQuery (the transformation is 
into a filter on row_number window function, similar to limit).

 * Adding a feature flag to enable/disable offset in subquery support



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43474) Add support to create DataFrame Reference in Spark connect

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43474:
---
Labels: pull-request-available  (was: )

> Add support to create DataFrame Reference in Spark connect
> --
>
> Key: SPARK-43474
> URL: https://issues.apache.org/jira/browse/SPARK-43474
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Peng Zhong
>Assignee: Raghu Angadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Add support in Spark Connect to cache a DataFrame on server side. From client 
> side, it can create a reference to that DataFrame given the cache key.
>  
> This function will be used in streaming foreachBatch(). Server needs to call 
> user function for every batch which takes a DataFrame as argument. With the 
> new function, we can just cache the DataFrame on the server. Pass the id back 
> to client which can creates the DataFrame reference. The server will replace 
> the reference when transforming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37377) SPJ: Initial implementation of Storage-Partitioned Join

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-37377:
---
Labels: pull-request-available  (was: )

> SPJ: Initial implementation of Storage-Partitioned Join
> ---
>
> Key: SPARK-37377
> URL: https://issues.apache.org/jira/browse/SPARK-37377
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.0
>
>
> This Jira tracks the initial implementation of storage-partitioned join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-40295:
---
Labels: pull-request-available  (was: )

> Allow v2 functions with literal args in write distribution and ordering
> ---
>
> Key: SPARK-40295
> URL: https://issues.apache.org/jira/browse/SPARK-40295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Anton Okolnychyi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> Spark should allow v2 function with literal args in write distribution and 
> ordering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46445) V2ExpressionUtils.toCatalystTransformOpt throws when resolving the bucket/sorted_bucket functions

2023-12-18 Thread Jira

Никита Соколов created SPARK-46445:
--

 Summary: V2ExpressionUtils.toCatalystTransformOpt throws when 
resolving the bucket/sorted_bucket functions
 Key: SPARK-46445
 URL: https://issues.apache.org/jira/browse/SPARK-46445
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Никита Соколов


I am trying to build a V2 data-source exposing its partitioning+bucketing 
properties so it would be possible to execute partition-wise joins. At the 
moment it is impossible to let the Spark engine transform a 
DataSourceV2Relation into a DataSourceV2ScanRelation because of this error:

 
{code:java}
org.apache.spark.sql.AnalysisException: [REQUIRES_SINGLE_PART_NAMESPACE] 
spark_catalog requires a single-part namespace, but got .  at 
org.apache.spark.sql.errors.QueryCompilationErrors$.requiresSinglePartNamespaceError(QueryCompilationErrors.scala:1336)
  at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog$TableIdentifierHelper.asFunctionIdentifier(V2SessionCatalog.scala:254)
at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadFunction(V2SessionCatalog.scala:351)
  at 
org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.loadV2FunctionOpt(V2ExpressionUtils.scala:128)
  at 
org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.$anonfun$toCatalystTransformOpt$6(V2ExpressionUtils.scala:114)
  at scala.Option.flatMap(Option.scala:271)   at 
org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.toCatalystTransformOpt(V2ExpressionUtils.scala:113)
 at 
org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.toCatalystOpt(V2ExpressionUtils.scala:82)
   at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.$anonfun$applyOrElse$1(V2ScanPartitioningAndOrdering.scala:47)
   at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)   at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)   at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)  at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)  at 
scala.collection.TraversableLike.map(TraversableLike.scala:286)  at 
scala.collection.TraversableLike.map$(TraversableLike.scala:279) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)  at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:47)
  at 
org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42)
 {code}
 

 
It looks like it is impossible for the library code to succeed:
here is the place where it tries to load the bucket function 
–[https://github.com/apache/spark/blob/e359318c4493e16a7546d70c9340ffc5015aacff/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L108]
here is an Identifier with no namespace being constructed – 
[https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L132]

here is the code trying to transform it to a FunctionIdentifier – 
[https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L437]
but this is impossible for an Identifier with no namespace – 
[https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L340]
 
I have managed to somehow bypass this by constructing the 
DataSourceV2ScanRelation myself, but I guess this is not an intended use and 
this forces me to push the filter-predicates down myself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46444) V2SessionCatalog#createTable should not load the table

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46444:
---
Labels: pull-request-available  (was: )

> V2SessionCatalog#createTable should not load the table
> --
>
> Key: SPARK-46444
> URL: https://issues.apache.org/jira/browse/SPARK-46444
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-46444) V2SessionCatalog#createTable should not load the table

2023-12-18 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-46444:
---

 Summary: V2SessionCatalog#createTable should not load the table
 Key: SPARK-46444
 URL: https://issues.apache.org/jira/browse/SPARK-46444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-46443) Decimal precision and scale should decided by JDBC dialect.

2023-12-18 Thread Jiaan Geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-46443:
---
Summary: Decimal precision and scale should decided by JDBC dialect.  (was: 
Ensure Decimal precision and scale should decided by JDBC dialect.)

> Decimal precision and scale should decided by JDBC dialect.
> ---
>
> Key: SPARK-46443
> URL: https://issues.apache.org/jira/browse/SPARK-46443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: (was: Apache Spark)

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: Apache Spark

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: (was: Apache Spark)

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: Apache Spark

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-40876:
--

Assignee: (was: Apache Spark)

> Spark's Vectorized ParquetReader should support type promotions
> ---
>
> Key: SPARK-40876
> URL: https://issues.apache.org/jira/browse/SPARK-40876
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.3.0
>Reporter: Alexey Kudinkin
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, 
> we hit an issue where we specify requested (projection) schema where one of 
> the field's type is widened from int32 to long.
> Expectation is that since this is totally legitimate primitive type 
> promotion, we should be able to read Ints into Longs w/ no problems (for ex, 
> Avro is able to do that perfectly fine).
> However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` 
> method fails w/ the exception listed below.
> Looking at the code, It actually seems to be allowing the opposite – it 
> allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or 
> Shorts for ex. I'm actually not sure what's the rationale for this behavior, 
> and this actually seems like a bug to me (as this will essentially be leading 
> to data truncation):
> {code:java}
> case INT32:
>   if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, 
> sparkType)) {
> return new IntegerUpdater();
>   } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) 
> {
> // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our 
> LongType.
> // For unsigned int32, it stores as plain signed int32 in Parquet when 
> dictionary
> // fallbacks. We read them as long values.
> return new UnsignedIntegerUpdater();
>   } else if (sparkType == DataTypes.ByteType) {
> return new ByteUpdater();
>   } else if (sparkType == DataTypes.ShortType) {
> return new ShortUpdater();
>   } else if (sparkType == DataTypes.DateType) {
> if ("CORRECTED".equals(datetimeRebaseMode)) {
>   return new IntegerUpdater();
> } else {
>   boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode);
>   return new IntegerWithRebaseUpdater(failIfRebase);
> }
>   }
>   break; {code}
> Exception:
> {code:java}
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304)
>     at org.apache.spark.RangePartitioner.(Partitioner.scala:171)
>     at 
>

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: (was: Apache Spark)

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45525:
--

Assignee: Apache Spark

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions

2023-12-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-40876:
--

Assignee: Apache Spark

> Spark's Vectorized ParquetReader should support type promotions
> ---
>
> Key: SPARK-40876
> URL: https://issues.apache.org/jira/browse/SPARK-40876
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 3.3.0
>Reporter: Alexey Kudinkin
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, 
> we hit an issue where we specify requested (projection) schema where one of 
> the field's type is widened from int32 to long.
> Expectation is that since this is totally legitimate primitive type 
> promotion, we should be able to read Ints into Longs w/ no problems (for ex, 
> Avro is able to do that perfectly fine).
> However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` 
> method fails w/ the exception listed below.
> Looking at the code, It actually seems to be allowing the opposite – it 
> allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or 
> Shorts for ex. I'm actually not sure what's the rationale for this behavior, 
> and this actually seems like a bug to me (as this will essentially be leading 
> to data truncation):
> {code:java}
> case INT32:
>   if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, 
> sparkType)) {
> return new IntegerUpdater();
>   } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) 
> {
> // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our 
> LongType.
> // For unsigned int32, it stores as plain signed int32 in Parquet when 
> dictionary
> // fallbacks. We read them as long values.
> return new UnsignedIntegerUpdater();
>   } else if (sparkType == DataTypes.ByteType) {
> return new ByteUpdater();
>   } else if (sparkType == DataTypes.ShortType) {
> return new ShortUpdater();
>   } else if (sparkType == DataTypes.DateType) {
> if ("CORRECTED".equals(datetimeRebaseMode)) {
>   return new IntegerUpdater();
> } else {
>   boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode);
>   return new IntegerWithRebaseUpdater(failIfRebase);
> }
>   }
>   break; {code}
> Exception:
> {code:java}
> at 
> org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402)
>     at 
> scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>     at 
> scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>     at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160)
>     at scala.Option.foreach(Option.scala:407)
>     at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584)
>     at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573)
>     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
>     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254)
>     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279)
>     at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>     at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>     at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
>     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304)
>     at org.apache.spark.RangePartitioner.(Partitioner.scala:171)
>     at 
>

[jira] [Resolved] (SPARK-46440) Set the rebase configs to the CORRECTED mode by default

2023-12-18 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-46440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-46440.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 44388
[https://github.com/apache/spark/pull/44388]

> Set the rebase configs to the CORRECTED mode by default
> ---
>
> Key: SPARK-46440
> URL: https://issues.apache.org/jira/browse/SPARK-46440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Set all rebase related SQL configs to the `CORRECTED` mode by default. Here 
> are the affected configs:
> - spark.sql.parquet.int96RebaseModeInWrite
> - spark.sql.parquet.datetimeRebaseModeInWrite
> - spark.sql.parquet.int96RebaseModeInRead
> - spark.sql.parquet.datetimeRebaseModeInRead
> - spark.sql.avro.datetimeRebaseModeInWrite
> - spark.sql.avro.datetimeRebaseModeInRead
> The configs were set to the `EXCEPTION` mode to give users a choice to select 
> proper mode for compatibility with old Spark versions <= 2.4.5. Those 
> versions are not able to detect the rebase mode from meta information in 
> parquet and avro files. Since the versions are out of broad usage, Spark 
> starting from the version 4.0.0 will write/ read ancient datatime without 
> rebasing and any exceptions. This should be more convenient for users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

45 matches

Mail list logo