[jira] [Updated] (SPARK-46453) SessionHolder doesn't throw exceptions from internalError()
[ https://issues.apache.org/jira/browse/SPARK-46453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46453: --- Labels: pull-request-available (was: ) > SessionHolder doesn't throw exceptions from internalError() > --- > > Key: SPARK-46453 > URL: https://issues.apache.org/jira/browse/SPARK-46453 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > Need to throw SparkException returned by internalError in SessionHolder > otherwise users won't see the internal error -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46453) SessionHolder doesn't throw exceptions from internalError()
Max Gekk created SPARK-46453: Summary: SessionHolder doesn't throw exceptions from internalError() Key: SPARK-46453 URL: https://issues.apache.org/jira/browse/SPARK-46453 Project: Spark Issue Type: Bug Components: Connect, SQL Affects Versions: 3.5.0, 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Need to throw SparkException returned by internalError in SessionHolder otherwise users won't see the internal error -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records
[ https://issues.apache.org/jira/browse/SPARK-46452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46452: --- Labels: pull-request-available (was: ) > Add a new API in DSv2 DataWriter to write an iterator of records > > > Key: SPARK-46452 > URL: https://issues.apache.org/jira/browse/SPARK-46452 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add a new API that takes an iterator of records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records
Allison Wang created SPARK-46452: Summary: Add a new API in DSv2 DataWriter to write an iterator of records Key: SPARK-46452 URL: https://issues.apache.org/jira/browse/SPARK-46452 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang Add a new API that takes an iterator of records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46451) Reorganize `GroupbyStatTests`
[ https://issues.apache.org/jira/browse/SPARK-46451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46451: --- Labels: pull-request-available (was: ) > Reorganize `GroupbyStatTests` > - > > Key: SPARK-46451 > URL: https://issues.apache.org/jira/browse/SPARK-46451 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46451) Reorganize `GroupbyStatTests`
Ruifeng Zheng created SPARK-46451: - Summary: Reorganize `GroupbyStatTests` Key: SPARK-46451 URL: https://issues.apache.org/jira/browse/SPARK-46451 Project: Spark Issue Type: Test Components: PS, Tests Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-46446. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44401 [https://github.com/apache/spark/pull/44401] > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-46446: --- Assignee: Jack Chen > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Assignee: Jack Chen >Priority: Major > Labels: pull-request-available > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46308) Forbid recursive error handling by adding recursion guards
[ https://issues.apache.org/jira/browse/SPARK-46308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-46308: Assignee: Alice Sayutina > Forbid recursive error handling by adding recursion guards > -- > > Key: SPARK-46308 > URL: https://issues.apache.org/jira/browse/SPARK-46308 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46308) Forbid recursive error handling by adding recursion guards
[ https://issues.apache.org/jira/browse/SPARK-46308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-46308. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44210 [https://github.com/apache/spark/pull/44210] > Forbid recursive error handling by adding recursion guards > -- > > Key: SPARK-46308 > URL: https://issues.apache.org/jira/browse/SPARK-46308 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Alice Sayutina >Assignee: Alice Sayutina >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function
[ https://issues.apache.org/jira/browse/SPARK-46450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798381#comment-17798381 ] Jungtaek Lim commented on SPARK-46450: -- It's a missing one and maybe we will have to document - session window is only working properly with batch/streaming aggregation. If you use it as normal function and not ingesting the value to aggregation, merging sessions is never triggered. > session_window doesn't identify sessions with provided gap when used as a > window function > - > > Key: SPARK-46450 > URL: https://issues.apache.org/jira/browse/SPARK-46450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1, 3.5.0 >Reporter: Juan Pumarino >Priority: Minor > > {{PARTITION BY session_window}} doesn't produce the expected results. Here's > an example: > {code:sql} > SELECT > id, > ts, > collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as > window_ids > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) > {code} > Actual result: > {code:java} > +---++--+ > |id |ts |window_ids| > +---++--+ > |1 |2023-12-11 01:10|[1] | > |2 |2023-12-11 01:15|[2] | > |3 |2023-12-11 01:40|[3] | > |4 |2023-12-11 02:05|[4] | > |5 |2023-12-11 03:15|[5] | > |6 |2023-12-11 03:20|[6] | > |7 |2023-12-11 04:10|[7] | > |8 |2023-12-11 05:05|[8] | > +---++--+ > {code} > Expected result, assigning rows to two sessions with 1-hour gap: > {code:java} > +---+++ > |id |ts |window_ids | > +---+++ > |1 |2023-12-11 01:10|[1, 2, 3, 4]| > |2 |2023-12-11 01:15|[1, 2, 3, 4]| > |3 |2023-12-11 01:40|[1, 2, 3, 4]| > |4 |2023-12-11 02:05|[1, 2, 3, 4]| > |5 |2023-12-11 03:15|[5, 6, 7, 8]| > |6 |2023-12-11 03:20|[5, 6, 7, 8]| > |7 |2023-12-11 04:10|[5, 6, 7, 8]| > |8 |2023-12-11 05:05|[5, 6, 7, 8]| > +---+++ > {code} > I compared its behavior with the results as a grouping function and with how > {{window()}} behaves in both cases, which seems to confirm that the result is > inconsistent. Here are the other examples: > *{{group by window()}}* > {code:sql} > SELECT > collect_list(id) AS ids, > collect_list(ts) AS tss, > window > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) > GROUP by window(ts, '1 hour') > {code} > Correctly assigns rows to 1-hour windows: > {code:java} > +-+--+--+ > |ids |tss |window > | > +-+--+--+ > |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 > 01:00:00, 2023-12-11 02:00:00}| > |[4] |[2023-12-11 02:05]|{2023-12-11 > 02:00:00, 2023-12-11 03:00:00}| > |[5, 6] |[2023-12-11 03:15, 2023-12-11 03:20] |{2023-12-11 > 03:00:00, 2023-12-11 04:00:00}| > |[7] |[2023-12-11 04:10]|{2023-12-11 > 04:00:00, 2023-12-11 05:00:00}| > |[8] |[2023-12-11 05:05]|{2023-12-11 > 05:00:00, 2023-12-11 06:00:00}| > +-+--+--+ > {code} > > *{{group by session_window()}}* > {code:sql} > SELECT > collect_list(id) AS ids, > collect_list(ts) AS tss, > session_window > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) > GROUP by session_window(ts, '1 hour') > {code} > Correctly assigns rows to two sessions with 1-hour gap: > {code:java} > +++--+ > |ids |tss > |session_window
[jira] [Comment Edited] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function
[ https://issues.apache.org/jira/browse/SPARK-46450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798381#comment-17798381 ] Jungtaek Lim edited comment on SPARK-46450 at 12/18/23 11:37 PM: - It's a missing one and maybe we will have to document - session window is only working properly with batch/streaming aggregation. (Because it requires custom logic to merge sessions.) If you use it as normal function and not ingesting the value to aggregation, merging sessions is never triggered. was (Author: kabhwan): It's a missing one and maybe we will have to document - session window is only working properly with batch/streaming aggregation. If you use it as normal function and not ingesting the value to aggregation, merging sessions is never triggered. > session_window doesn't identify sessions with provided gap when used as a > window function > - > > Key: SPARK-46450 > URL: https://issues.apache.org/jira/browse/SPARK-46450 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.1, 3.5.0 >Reporter: Juan Pumarino >Priority: Minor > > {{PARTITION BY session_window}} doesn't produce the expected results. Here's > an example: > {code:sql} > SELECT > id, > ts, > collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as > window_ids > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) > {code} > Actual result: > {code:java} > +---++--+ > |id |ts |window_ids| > +---++--+ > |1 |2023-12-11 01:10|[1] | > |2 |2023-12-11 01:15|[2] | > |3 |2023-12-11 01:40|[3] | > |4 |2023-12-11 02:05|[4] | > |5 |2023-12-11 03:15|[5] | > |6 |2023-12-11 03:20|[6] | > |7 |2023-12-11 04:10|[7] | > |8 |2023-12-11 05:05|[8] | > +---++--+ > {code} > Expected result, assigning rows to two sessions with 1-hour gap: > {code:java} > +---+++ > |id |ts |window_ids | > +---+++ > |1 |2023-12-11 01:10|[1, 2, 3, 4]| > |2 |2023-12-11 01:15|[1, 2, 3, 4]| > |3 |2023-12-11 01:40|[1, 2, 3, 4]| > |4 |2023-12-11 02:05|[1, 2, 3, 4]| > |5 |2023-12-11 03:15|[5, 6, 7, 8]| > |6 |2023-12-11 03:20|[5, 6, 7, 8]| > |7 |2023-12-11 04:10|[5, 6, 7, 8]| > |8 |2023-12-11 05:05|[5, 6, 7, 8]| > +---+++ > {code} > I compared its behavior with the results as a grouping function and with how > {{window()}} behaves in both cases, which seems to confirm that the result is > inconsistent. Here are the other examples: > *{{group by window()}}* > {code:sql} > SELECT > collect_list(id) AS ids, > collect_list(ts) AS tss, > window > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) > GROUP by window(ts, '1 hour') > {code} > Correctly assigns rows to 1-hour windows: > {code:java} > +-+--+--+ > |ids |tss |window > | > +-+--+--+ > |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 > 01:00:00, 2023-12-11 02:00:00}| > |[4] |[2023-12-11 02:05]|{2023-12-11 > 02:00:00, 2023-12-11 03:00:00}| > |[5, 6] |[2023-12-11 03:15, 2023-12-11 03:20] |{2023-12-11 > 03:00:00, 2023-12-11 04:00:00}| > |[7] |[2023-12-11 04:10]|{2023-12-11 > 04:00:00, 2023-12-11 05:00:00}| > |[8] |[2023-12-11 05:05]|{2023-12-11 > 05:00:00, 2023-12-11 06:00:00}| > +-+--+--+ > {code} > > *{{group by session_window()}}* > {code:sql} > SELECT > collect_list(id) AS ids, > collect_list(ts) AS tss, > session_window > FROM VALUES > (1, "2023-12-11 01:10"), > (2, "2023-12-11 01:15"), > (3, "2023-12-11 01:40"), > (4, "2023-12-11 02:05"), > (5, "2023-12-11 03:15"), > (6, "2023-12-11 03:20"), > (7, "2023-12-11 04:10"), > (8, "2023-12-11 05:05") > AS tab(id, ts) >
[jira] [Created] (SPARK-46450) session_window doesn't identify sessions with provided gap when used as a window function
Juan Pumarino created SPARK-46450: - Summary: session_window doesn't identify sessions with provided gap when used as a window function Key: SPARK-46450 URL: https://issues.apache.org/jira/browse/SPARK-46450 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0, 3.4.1 Reporter: Juan Pumarino {{PARTITION BY session_window}} doesn't produce the expected results. Here's an example: {code:sql} SELECT id, ts, collect_list(id) OVER (PARTITION BY session_window(ts, '1 hour')) as window_ids FROM VALUES (1, "2023-12-11 01:10"), (2, "2023-12-11 01:15"), (3, "2023-12-11 01:40"), (4, "2023-12-11 02:05"), (5, "2023-12-11 03:15"), (6, "2023-12-11 03:20"), (7, "2023-12-11 04:10"), (8, "2023-12-11 05:05") AS tab(id, ts) {code} Actual result: {code:java} +---++--+ |id |ts |window_ids| +---++--+ |1 |2023-12-11 01:10|[1] | |2 |2023-12-11 01:15|[2] | |3 |2023-12-11 01:40|[3] | |4 |2023-12-11 02:05|[4] | |5 |2023-12-11 03:15|[5] | |6 |2023-12-11 03:20|[6] | |7 |2023-12-11 04:10|[7] | |8 |2023-12-11 05:05|[8] | +---++--+ {code} Expected result, assigning rows to two sessions with 1-hour gap: {code:java} +---+++ |id |ts |window_ids | +---+++ |1 |2023-12-11 01:10|[1, 2, 3, 4]| |2 |2023-12-11 01:15|[1, 2, 3, 4]| |3 |2023-12-11 01:40|[1, 2, 3, 4]| |4 |2023-12-11 02:05|[1, 2, 3, 4]| |5 |2023-12-11 03:15|[5, 6, 7, 8]| |6 |2023-12-11 03:20|[5, 6, 7, 8]| |7 |2023-12-11 04:10|[5, 6, 7, 8]| |8 |2023-12-11 05:05|[5, 6, 7, 8]| +---+++ {code} I compared its behavior with the results as a grouping function and with how {{window()}} behaves in both cases, which seems to confirm that the result is inconsistent. Here are the other examples: *{{group by window()}}* {code:sql} SELECT collect_list(id) AS ids, collect_list(ts) AS tss, window FROM VALUES (1, "2023-12-11 01:10"), (2, "2023-12-11 01:15"), (3, "2023-12-11 01:40"), (4, "2023-12-11 02:05"), (5, "2023-12-11 03:15"), (6, "2023-12-11 03:20"), (7, "2023-12-11 04:10"), (8, "2023-12-11 05:05") AS tab(id, ts) GROUP by window(ts, '1 hour') {code} Correctly assigns rows to 1-hour windows: {code:java} +-+--+--+ |ids |tss |window | +-+--+--+ |[1, 2, 3]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40]|{2023-12-11 01:00:00, 2023-12-11 02:00:00}| |[4] |[2023-12-11 02:05]|{2023-12-11 02:00:00, 2023-12-11 03:00:00}| |[5, 6] |[2023-12-11 03:15, 2023-12-11 03:20] |{2023-12-11 03:00:00, 2023-12-11 04:00:00}| |[7] |[2023-12-11 04:10]|{2023-12-11 04:00:00, 2023-12-11 05:00:00}| |[8] |[2023-12-11 05:05]|{2023-12-11 05:00:00, 2023-12-11 06:00:00}| +-+--+--+ {code} *{{group by session_window()}}* {code:sql} SELECT collect_list(id) AS ids, collect_list(ts) AS tss, session_window FROM VALUES (1, "2023-12-11 01:10"), (2, "2023-12-11 01:15"), (3, "2023-12-11 01:40"), (4, "2023-12-11 02:05"), (5, "2023-12-11 03:15"), (6, "2023-12-11 03:20"), (7, "2023-12-11 04:10"), (8, "2023-12-11 05:05") AS tab(id, ts) GROUP by session_window(ts, '1 hour') {code} Correctly assigns rows to two sessions with 1-hour gap: {code:java} +++--+ |ids |tss |session_window| +++--+ |[1, 2, 3, 4]|[2023-12-11 01:10, 2023-12-11 01:15, 2023-12-11 01:40, 2023-12-11 02:05]|{2023-12-11 01:10:00, 2023-12-11 03:05:00}| |[5, 6, 7, 8]|[2023-12-11 03:15, 2023-12-11 03:20, 2023-12-11 04:10, 2023-12-11 05:05]|{2023-12-11 03:15:00, 2023-12-11 06:05:00}| +++--+ {code} *{{partition by window()}}* {code:sql} SELECT id, ts, collect_list(id) OVER (PARTITION BY window(ts, '1 hour')) as window_ids FROM VALUES (1, "2023-12-11 01:10"), (2, "2023-12-11 01:15"),
[jira] [Resolved] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test
[ https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46411. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44359 [https://github.com/apache/spark/pull/44359] > Change to use bcprov/bcpkix-jdk18on for test > > > Key: SPARK-46411 > URL: https://issues.apache.org/jira/browse/SPARK-46411 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test
[ https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46411: - Assignee: Yang Jie > Change to use bcprov/bcpkix-jdk18on for test > > > Key: SPARK-46411 > URL: https://issues.apache.org/jira/browse/SPARK-46411 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46447) Remove the legacy datetime rebasing SQL configs
[ https://issues.apache.org/jira/browse/SPARK-46447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46447: --- Labels: pull-request-available (was: ) > Remove the legacy datetime rebasing SQL configs > --- > > Key: SPARK-46447 > URL: https://issues.apache.org/jira/browse/SPARK-46447 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Minor > Labels: pull-request-available > > Remove already deprecated SQL configs (alternatives to other configs): > - spark.sql.legacy.parquet.int96RebaseModeInWrite > - spark.sql.legacy.parquet.datetimeRebaseModeInWrite > - spark.sql.legacy.parquet.int96RebaseModeInRead > - spark.sql.legacy.parquet.datetimeRebaseModeInRead > - spark.sql.legacy.avro.datetimeRebaseModeInWrite > - spark.sql.legacy.avro.datetimeRebaseModeInRead -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46439) Move IO-related tests to `pyspark.pandas.tests.io.*`
[ https://issues.apache.org/jira/browse/SPARK-46439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46439. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44395 [https://github.com/apache/spark/pull/44395] > Move IO-related tests to `pyspark.pandas.tests.io.*` > > > Key: SPARK-46439 > URL: https://issues.apache.org/jira/browse/SPARK-46439 > Project: Spark > Issue Type: Test > Components: PS, Tests >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46420) Remove unused transport from SparkSQLCLIDriver
[ https://issues.apache.org/jira/browse/SPARK-46420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46420. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44370 [https://github.com/apache/spark/pull/44370] > Remove unused transport from SparkSQLCLIDriver > -- > > Key: SPARK-46420 > URL: https://issues.apache.org/jira/browse/SPARK-46420 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Cheng Pan >Assignee: Cheng Pan >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-46289) Exception when ordering by UDT in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-46289: - Assignee: Bruce Robbins > Exception when ordering by UDT in interpreted mode > -- > > Key: SPARK-46289 > URL: https://issues.apache.org/jira/browse/SPARK-46289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Labels: pull-request-available > > In interpreted mode, ordering by a UDT will result in an exception. For > example: > {noformat} > import org.apache.spark.ml.linalg.{DenseVector, Vector} > val df = Seq.tabulate(30) { x => > (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + > 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) > }.toDF("id", "c1", "c2", "c3") > df.createOrReplaceTempView("df") > // this works > sql("select * from df order by c3").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this gets an error > sql("select * from df order by c3").collect > {noformat} > The second {{collect}} action results in the following exception: > {noformat} > org.apache.spark.SparkIllegalArgumentException: Type > UninitializedPhysicalType does not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > {noformat} > Note: You don't get an error if you use {{show}} rather than {{collect}}. > This is because {{show}} will implicitly add a {{limit}}, in which case the > ordering is performed by {{TakeOrderedAndProject}} rather than > {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46289) Exception when ordering by UDT in interpreted mode
[ https://issues.apache.org/jira/browse/SPARK-46289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46289. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44361 [https://github.com/apache/spark/pull/44361] > Exception when ordering by UDT in interpreted mode > -- > > Key: SPARK-46289 > URL: https://issues.apache.org/jira/browse/SPARK-46289 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.2, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > In interpreted mode, ordering by a UDT will result in an exception. For > example: > {noformat} > import org.apache.spark.ml.linalg.{DenseVector, Vector} > val df = Seq.tabulate(30) { x => > (x, x + 1, x + 2, new DenseVector(Array((x/100.0).toDouble, ((x + > 1)/100.0).toDouble, ((x + 3)/100.0).toDouble))) > }.toDF("id", "c1", "c2", "c3") > df.createOrReplaceTempView("df") > // this works > sql("select * from df order by c3").collect > sql("set spark.sql.codegen.wholeStage=false") > sql("set spark.sql.codegen.factoryMode=NO_CODEGEN") > // this gets an error > sql("select * from df order by c3").collect > {noformat} > The second {{collect}} action results in the following exception: > {noformat} > org.apache.spark.SparkIllegalArgumentException: Type > UninitializedPhysicalType does not support ordered operations. > at > org.apache.spark.sql.errors.QueryExecutionErrors$.orderedOperationUnsupportedByDataTypeError(QueryExecutionErrors.scala:348) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:332) > at > org.apache.spark.sql.catalyst.types.UninitializedPhysicalType$.ordering(PhysicalDataType.scala:329) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:60) > at > org.apache.spark.sql.catalyst.expressions.InterpretedOrdering.compare(ordering.scala:39) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter$RowComparator.compare(UnsafeExternalRowSorter.java:254) > {noformat} > Note: You don't get an error if you use {{show}} rather than {{collect}}. > This is because {{show}} will implicitly add a {{limit}}, in which case the > ordering is performed by {{TakeOrderedAndProject}} rather than > {{UnsafeExternalRowSorter}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-46438) Fix the scope of toggleFlamegraph and improve test
[ https://issues.apache.org/jira/browse/SPARK-46438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-46438. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44394 [https://github.com/apache/spark/pull/44394] > Fix the scope of toggleFlamegraph and improve test > -- > > Key: SPARK-46438 > URL: https://issues.apache.org/jira/browse/SPARK-46438 > Project: Spark > Issue Type: Sub-task > Components: UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46449) Add ability to create databases via Catalog API
Nicholas Chammas created SPARK-46449: Summary: Add ability to create databases via Catalog API Key: SPARK-46449 URL: https://issues.apache.org/jira/browse/SPARK-46449 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Nicholas Chammas As of Spark 3.5, the only way to create a database is via SQL. The Catalog API should offer an equivalent. Perhaps something like: {code:python} spark.catalog.createDatabase( name: str, existsOk: bool = False, comment: str = None, location: str = None, properties: dict = None, ) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46448) InlineTable column should be nullable when converted to Decimal might cause null
[ https://issues.apache.org/jira/browse/SPARK-46448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-46448: - Summary: InlineTable column should be nullable when converted to Decimal might cause null (was: InlineTable column should be nullable when converted to Decimal and overflow) > InlineTable column should be nullable when converted to Decimal might cause > null > > > Key: SPARK-46448 > URL: https://issues.apache.org/jira/browse/SPARK-46448 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46448) InlineTable column should be nullable when converted to Decimal and overflow
Rui Wang created SPARK-46448: Summary: InlineTable column should be nullable when converted to Decimal and overflow Key: SPARK-46448 URL: https://issues.apache.org/jira/browse/SPARK-46448 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46446: --- Labels: pull-request-available (was: ) > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > Labels: pull-request-available > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-46446: -- Description: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. (The bug was initially found by https://github.com/apache/spark/pull/44084) I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support was: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. (The bug was initially > found by https://github.com/apache/spark/pull/44084) > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
[ https://issues.apache.org/jira/browse/SPARK-46446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jack Chen updated SPARK-46446: -- Description: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} create table x(x1 int, x2 int); insert into x values (1, 1), (2, 2); create table y(y1 int, y2 int); insert into y values (1, 1), (1, 2), (2, 4); select * from x where exists (select * from y where x1 = y1 limit 1 offset 2){code} Correct result: empty set, see postgres: [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] Spark result: Array([2,2]) The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support was: Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} SELECT * FROM emp join lateral (SELECT dept.dept_name FROM dept WHERE emp.dept_id = dept.dept_id LIMIT 5 OFFSET 3); {code} The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support > Correctness bug in correlated subquery with OFFSET > -- > > Key: SPARK-46446 > URL: https://issues.apache.org/jira/browse/SPARK-46446 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jack Chen >Priority: Major > > Subqueries with correlation under LIMIT with OFFSET have a correctness bug, > introduced recently when support for correlation under OFFSET was enabled but > were not handled correctly. (So we went from unsupported, query throws error > -> wrong results.) > It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS > > It's easy to repro with a query like > {code:java} > create table x(x1 int, x2 int); > insert into x values (1, 1), (2, 2); > create table y(y1 int, y2 int); > insert into y values (1, 1), (1, 2), (2, 4); > select * from x where exists (select * from y where x1 = y1 limit 1 offset > 2){code} > Correct result: empty set, see postgres: > [https://www.db-fiddle.com/f/dtXNn7hwDnemiCTUhvwgYM/0] > Spark result: Array([2,2]) > > The > [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] > where it was introduced added a test for it, but the golden file results for > the test actually were incorrect and we didn't notice. > I'll work on both: > * Adding support for offset in DecorrelateInnerQuery (the transformation is > into a filter on row_number window function, similar to limit). > * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33144) Connot insert overwite multiple partition, get exception "get partition: Value for key name is null or empty"
[ https://issues.apache.org/jira/browse/SPARK-33144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798232#comment-17798232 ] Wan Kun commented on SPARK-33144: - Speculative task also can cause this issue: Spark Driver Log 23/12/17 11:28:10 INFO TaskSetManager: Starting task 39.1 in stage 5.0 (TID 154) (node-013.com, executor 18, partition 39, PROCESS_LOCAL, 7317 bytes) taskResourceAssignments Map() 23/12/17 11:28:11 INFO TaskSetManager: Killing attempt 1 for task 39.1 in stage 5.0 (TID 154) on node-013.com as the attempt 0 succeeded on hdc42-mcc10-01-0210-6008-022.com 23/12/17 11:28:11 INFO DAGScheduler: ResultStage 5 (main at NativeMethodAccessorImpl.java:0) finished in 2.973 s 23/12/17 11:28:11 INFO DAGScheduler: Job 2 is finished. Cancelling potential speculative or zombie tasks for this job 23/12/17 11:28:11 INFO YarnClusterScheduler: Killing all running tasks in stage 5: Stage finished 23/12/17 11:28:11 INFO DAGScheduler: Job 2 finished: main at NativeMethodAccessorImpl.java:0, took 2.992123 s 23/12/17 11:28:11 INFO FileFormatWriter: Start to commit write Job 9797dcfc-fa17-47ce-907c-97932130f3d3. 23/12/17 11:28:11 INFO FileFormatWriter: Write Job 9797dcfc-fa17-47ce-907c-97932130f3d3 committed. Elapsed time: 56 ms. // // delete ["", "user", "hive", "warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary"] // and create ["", "user", "hive", "warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_SUCCESS"] // 23/12/17 11:28:11 INFO FileFormatWriter: Finished processing stats for write job 9797dcfc-fa17-47ce-907c-97932130f3d3. 23/12/17 11:28:12 WARN TaskSetManager: Lost task 39.1 in stage 5.0 (TID 154) (node-013.com executor 18): TaskKilled (Stage finished) 23/12/17 11:28:12 INFO TaskSetManager: task 39.1 in stage 5.0 (TID 154) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 23/12/17 11:28:12 INFO YarnClusterScheduler: Removed TaskSet 5.0, whose tasks have all completed, from pool ... 23/12/17 11:28:14 INFO Hive: New loading path = viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0 with partSpec \{week_beg_dt=, ind=} ... 23/12/17 11:28:14 ERROR Hive: Exception when loading partition with parameters partPath=viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0, table=gross_chrn_byr_srw, partSpec=\{week_beg_dt=, ind=}, replace=true, listBucketingEnabled=false, isAcid=false, hasFollowingStatsTask=false org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key week_beg_dt is null or empty at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2240) at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:2188) at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1618) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1929) at org.apache.hadoop.hive.ql.metadata.Hive$3.call(Hive.java:1920) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) task 39.1 in stage 5.0 (TID 154) log: Although the task is killed, the parquet file is still created. 2023-12-17 11:28:13 /10.216.139.145 create ["", "user", "hive", "warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary","0","_temporary","attempt_202312171128088986155311074942191_0005_m_39_154","week_beg_dt=2023-09-24","ind=T4","part-00039-cd9174aa-8010-41fa-8fbe-46d01d2b1efb.c000"] Delete subdirectory $OUTPUT_DIR/_temporary/0/xxx and caused directory $OUTPUT_DIR/_temporary/0 leak: 2023-12-17 11:28:13 /10.216.139.145 delete ["", "user", "hive", "warehouse",".hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1","-ext-1","_temporary","0","_temporary","attempt_202312171128088986155311074942191_0005_m_39_154"] 23/12/17 11:28:12 INFO ParquetRecordWriterWrapper: creating real writer to write at viewfs://nn/user/hive/warehouse/.hive-staging_hive_2023-12-17_11-27-52_763_730376237774235462-1/-ext-1/_temporary/0/_temporary/attempt_202312171128088986155311074942191_0005_m_39_154/week_beg_dt=2023-09-24/ind=T4/part-00039-cd9174aa-8010-41fa-8fbe-46d01d2b1efb.c000 23/12/17 11:28:12 ERROR Utils: Aborting task org.apache.hadoop.hive.ql.metadata.HiveException: java.io.InterruptedIOException: Interrupted waiting to send RPC request to server at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:274) at
[jira] [Created] (SPARK-46446) Correctness bug in correlated subquery with OFFSET
Jack Chen created SPARK-46446: - Summary: Correctness bug in correlated subquery with OFFSET Key: SPARK-46446 URL: https://issues.apache.org/jira/browse/SPARK-46446 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Jack Chen Subqueries with correlation under LIMIT with OFFSET have a correctness bug, introduced recently when support for correlation under OFFSET was enabled but were not handled correctly. (So we went from unsupported, query throws error -> wrong results.) It’s a bug in all types of correlated subqueries: scalar, lateral, IN, EXISTS It's easy to repro with a query like {code:java} SELECT * FROM emp join lateral (SELECT dept.dept_name FROM dept WHERE emp.dept_id = dept.dept_id LIMIT 5 OFFSET 3); {code} The [PR|https://github.com/apache/spark/pull/43111/files/324a106611e6d62c31535cfc43863fdaa16e5dda#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR1403] where it was introduced added a test for it, but the golden file results for the test actually were incorrect and we didn't notice. I'll work on both: * Adding support for offset in DecorrelateInnerQuery (the transformation is into a filter on row_number window function, similar to limit). * Adding a feature flag to enable/disable offset in subquery support -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43474) Add support to create DataFrame Reference in Spark connect
[ https://issues.apache.org/jira/browse/SPARK-43474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43474: --- Labels: pull-request-available (was: ) > Add support to create DataFrame Reference in Spark connect > -- > > Key: SPARK-43474 > URL: https://issues.apache.org/jira/browse/SPARK-43474 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Peng Zhong >Assignee: Raghu Angadi >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > Add support in Spark Connect to cache a DataFrame on server side. From client > side, it can create a reference to that DataFrame given the cache key. > > This function will be used in streaming foreachBatch(). Server needs to call > user function for every batch which takes a DataFrame as argument. With the > new function, we can just cache the DataFrame on the server. Pass the id back > to client which can creates the DataFrame reference. The server will replace > the reference when transforming. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37377) SPJ: Initial implementation of Storage-Partitioned Join
[ https://issues.apache.org/jira/browse/SPARK-37377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-37377: --- Labels: pull-request-available (was: ) > SPJ: Initial implementation of Storage-Partitioned Join > --- > > Key: SPARK-37377 > URL: https://issues.apache.org/jira/browse/SPARK-37377 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 3.3.0 > > > This Jira tracks the initial implementation of storage-partitioned join. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40295) Allow v2 functions with literal args in write distribution and ordering
[ https://issues.apache.org/jira/browse/SPARK-40295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-40295: --- Labels: pull-request-available (was: ) > Allow v2 functions with literal args in write distribution and ordering > --- > > Key: SPARK-40295 > URL: https://issues.apache.org/jira/browse/SPARK-40295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Anton Okolnychyi >Assignee: Anton Okolnychyi >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Spark should allow v2 function with literal args in write distribution and > ordering. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46445) V2ExpressionUtils.toCatalystTransformOpt throws when resolving the bucket/sorted_bucket functions
Никита Соколов created SPARK-46445: -- Summary: V2ExpressionUtils.toCatalystTransformOpt throws when resolving the bucket/sorted_bucket functions Key: SPARK-46445 URL: https://issues.apache.org/jira/browse/SPARK-46445 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Никита Соколов I am trying to build a V2 data-source exposing its partitioning+bucketing properties so it would be possible to execute partition-wise joins. At the moment it is impossible to let the Spark engine transform a DataSourceV2Relation into a DataSourceV2ScanRelation because of this error: {code:java} org.apache.spark.sql.AnalysisException: [REQUIRES_SINGLE_PART_NAMESPACE] spark_catalog requires a single-part namespace, but got . at org.apache.spark.sql.errors.QueryCompilationErrors$.requiresSinglePartNamespaceError(QueryCompilationErrors.scala:1336) at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog$TableIdentifierHelper.asFunctionIdentifier(V2SessionCatalog.scala:254) at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadFunction(V2SessionCatalog.scala:351) at org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.loadV2FunctionOpt(V2ExpressionUtils.scala:128) at org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.$anonfun$toCatalystTransformOpt$6(V2ExpressionUtils.scala:114) at scala.Option.flatMap(Option.scala:271) at org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.toCatalystTransformOpt(V2ExpressionUtils.scala:113) at org.apache.spark.sql.catalyst.expressions.V2ExpressionUtils$.toCatalystOpt(V2ExpressionUtils.scala:82) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.$anonfun$applyOrElse$1(V2ScanPartitioningAndOrdering.scala:47) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:47) at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:42) {code} It looks like it is impossible for the library code to succeed: here is the place where it tries to load the bucket function –[https://github.com/apache/spark/blob/e359318c4493e16a7546d70c9340ffc5015aacff/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L108] here is an Identifier with no namespace being constructed – [https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala#L132] here is the code trying to transform it to a FunctionIdentifier – [https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L437] but this is impossible for an Identifier with no namespace – [https://github.com/apache/spark/blob/bfafad4d47b4f60e93d17ccc3a8dcc8bae03cf9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2SessionCatalog.scala#L340] I have managed to somehow bypass this by constructing the DataSourceV2ScanRelation myself, but I guess this is not an intended use and this forces me to push the filter-predicates down myself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46444) V2SessionCatalog#createTable should not load the table
[ https://issues.apache.org/jira/browse/SPARK-46444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46444: --- Labels: pull-request-available (was: ) > V2SessionCatalog#createTable should not load the table > -- > > Key: SPARK-46444 > URL: https://issues.apache.org/jira/browse/SPARK-46444 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46444) V2SessionCatalog#createTable should not load the table
Wenchen Fan created SPARK-46444: --- Summary: V2SessionCatalog#createTable should not load the table Key: SPARK-46444 URL: https://issues.apache.org/jira/browse/SPARK-46444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46443) Decimal precision and scale should decided by JDBC dialect.
[ https://issues.apache.org/jira/browse/SPARK-46443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-46443: --- Summary: Decimal precision and scale should decided by JDBC dialect. (was: Ensure Decimal precision and scale should decided by JDBC dialect.) > Decimal precision and scale should decided by JDBC dialect. > --- > > Key: SPARK-46443 > URL: https://issues.apache.org/jira/browse/SPARK-46443 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: (was: Apache Spark) > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: Apache Spark > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: (was: Apache Spark) > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: Apache Spark > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions
[ https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-40876: -- Assignee: (was: Apache Spark) > Spark's Vectorized ParquetReader should support type promotions > --- > > Key: SPARK-40876 > URL: https://issues.apache.org/jira/browse/SPARK-40876 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.3.0 >Reporter: Alexey Kudinkin >Priority: Major > Labels: pull-request-available > > Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, > we hit an issue where we specify requested (projection) schema where one of > the field's type is widened from int32 to long. > Expectation is that since this is totally legitimate primitive type > promotion, we should be able to read Ints into Longs w/ no problems (for ex, > Avro is able to do that perfectly fine). > However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` > method fails w/ the exception listed below. > Looking at the code, It actually seems to be allowing the opposite – it > allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or > Shorts for ex. I'm actually not sure what's the rationale for this behavior, > and this actually seems like a bug to me (as this will essentially be leading > to data truncation): > {code:java} > case INT32: > if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, > sparkType)) { > return new IntegerUpdater(); > } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) > { > // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our > LongType. > // For unsigned int32, it stores as plain signed int32 in Parquet when > dictionary > // fallbacks. We read them as long values. > return new UnsignedIntegerUpdater(); > } else if (sparkType == DataTypes.ByteType) { > return new ByteUpdater(); > } else if (sparkType == DataTypes.ShortType) { > return new ShortUpdater(); > } else if (sparkType == DataTypes.DateType) { > if ("CORRECTED".equals(datetimeRebaseMode)) { > return new IntegerUpdater(); > } else { > boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode); > return new IntegerWithRebaseUpdater(failIfRebase); > } > } > break; {code} > Exception: > {code:java} > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) > at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304) > at org.apache.spark.RangePartitioner.(Partitioner.scala:171) > at >
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: (was: Apache Spark) > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45525: -- Assignee: Apache Spark > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40876) Spark's Vectorized ParquetReader should support type promotions
[ https://issues.apache.org/jira/browse/SPARK-40876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-40876: -- Assignee: Apache Spark > Spark's Vectorized ParquetReader should support type promotions > --- > > Key: SPARK-40876 > URL: https://issues.apache.org/jira/browse/SPARK-40876 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 3.3.0 >Reporter: Alexey Kudinkin >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > > Currently, when reading Parquet table using Spark's `VectorizedColumnReader`, > we hit an issue where we specify requested (projection) schema where one of > the field's type is widened from int32 to long. > Expectation is that since this is totally legitimate primitive type > promotion, we should be able to read Ints into Longs w/ no problems (for ex, > Avro is able to do that perfectly fine). > However, we're facing an issue where `ParquetVectorUpdaterFactory.getUpdater` > method fails w/ the exception listed below. > Looking at the code, It actually seems to be allowing the opposite – it > allows to "down-size" Int32s persisted in the Parquet to be read as Bytes or > Shorts for ex. I'm actually not sure what's the rationale for this behavior, > and this actually seems like a bug to me (as this will essentially be leading > to data truncation): > {code:java} > case INT32: > if (sparkType == DataTypes.IntegerType || canReadAsIntDecimal(descriptor, > sparkType)) { > return new IntegerUpdater(); > } else if (sparkType == DataTypes.LongType && isUnsignedIntTypeMatched(32)) > { > // In `ParquetToSparkSchemaConverter`, we map parquet UINT32 to our > LongType. > // For unsigned int32, it stores as plain signed int32 in Parquet when > dictionary > // fallbacks. We read them as long values. > return new UnsignedIntegerUpdater(); > } else if (sparkType == DataTypes.ByteType) { > return new ByteUpdater(); > } else if (sparkType == DataTypes.ShortType) { > return new ShortUpdater(); > } else if (sparkType == DataTypes.DateType) { > if ("CORRECTED".equals(datetimeRebaseMode)) { > return new IntegerUpdater(); > } else { > boolean failIfRebase = "EXCEPTION".equals(datetimeRebaseMode); > return new IntegerWithRebaseUpdater(failIfRebase); > } > } > break; {code} > Exception: > {code:java} > at > org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2454) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2403) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2402) > at > scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at > scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2402) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1160) > at scala.Option.foreach(Option.scala:407) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1160) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2642) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2584) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2573) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2214) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2235) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2254) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2279) > at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) > at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:304) > at org.apache.spark.RangePartitioner.(Partitioner.scala:171) > at >
[jira] [Resolved] (SPARK-46440) Set the rebase configs to the CORRECTED mode by default
[ https://issues.apache.org/jira/browse/SPARK-46440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-46440. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44388 [https://github.com/apache/spark/pull/44388] > Set the rebase configs to the CORRECTED mode by default > --- > > Key: SPARK-46440 > URL: https://issues.apache.org/jira/browse/SPARK-46440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Set all rebase related SQL configs to the `CORRECTED` mode by default. Here > are the affected configs: > - spark.sql.parquet.int96RebaseModeInWrite > - spark.sql.parquet.datetimeRebaseModeInWrite > - spark.sql.parquet.int96RebaseModeInRead > - spark.sql.parquet.datetimeRebaseModeInRead > - spark.sql.avro.datetimeRebaseModeInWrite > - spark.sql.avro.datetimeRebaseModeInRead > The configs were set to the `EXCEPTION` mode to give users a choice to select > proper mode for compatibility with old Spark versions <= 2.4.5. Those > versions are not able to detect the rebase mode from meta information in > parquet and avro files. Since the versions are out of broad usage, Spark > starting from the version 4.0.0 will write/ read ancient datatime without > rebasing and any exceptions. This should be more convenient for users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org