[jira] [Resolved] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT
[ https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-47265. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45368 [https://github.com/apache/spark/pull/45368] > Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., > columns: Array[Column], ...)` in UT > > > Key: SPARK-47265 > URL: https://issues.apache.org/jira/browse/SPARK-47265 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT
[ https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-47265: --- Assignee: BingKun Pan > Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., > columns: Array[Column], ...)` in UT > > > Key: SPARK-47265 > URL: https://issues.apache.org/jira/browse/SPARK-47265 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Summary: Improve missingInput calculation (was: Fix missingInput calculation) > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Description: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes. Let's try to improve it. > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {{QueryPlan.missingInput()}} calculation seems to be the root cause of > {{DeduplicateRelations}} slownes. Let's try to improve it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47319) Improve missingInput calculation
[ https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-47319: --- Description: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes in some cases. Let's try to improve it. (was: {{QueryPlan.missingInput()}} calculation seems to be the root cause of {{DeduplicateRelations}} slownes. Let's try to improve it.) > Improve missingInput calculation > > > Key: SPARK-47319 > URL: https://issues.apache.org/jira/browse/SPARK-47319 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > {{QueryPlan.missingInput()}} calculation seems to be the root cause of > {{DeduplicateRelations}} slownes in some cases. Let's try to improve it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`
[ https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47322: -- Assignee: Apache Spark > Make `withColumnsRenamed` duplicated column name handling consisten with > `withColumnRenamed` > - > > Key: SPARK-47322 > URL: https://issues.apache.org/jira/browse/SPARK-47322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`
[ https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47322: -- Assignee: (was: Apache Spark) > Make `withColumnsRenamed` duplicated column name handling consisten with > `withColumnRenamed` > - > > Key: SPARK-47322 > URL: https://issues.apache.org/jira/browse/SPARK-47322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47254) Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]
[ https://issues.apache.org/jira/browse/SPARK-47254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47254: -- Assignee: (was: Apache Spark) > Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9] > - > > Key: SPARK-47254 > URL: https://issues.apache.org/jira/browse/SPARK-47254 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_325[1-9]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47254) Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]
[ https://issues.apache.org/jira/browse/SPARK-47254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47254: -- Assignee: Apache Spark > Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9] > - > > Key: SPARK-47254 > URL: https://issues.apache.org/jira/browse/SPARK-47254 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_325[1-9]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47316) Fix TimestampNTZ in Postgres Array
[ https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-47316: -- Assignee: (was: Apache Spark) > Fix TimestampNTZ in Postgres Array > --- > > Key: SPARK-47316 > URL: https://issues.apache.org/jira/browse/SPARK-47316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47255) Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]
[ https://issues.apache.org/jira/browse/SPARK-47255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824692#comment-17824692 ] Milan Dankovic commented on SPARK-47255: I am working on it. > Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9] > - > > Key: SPARK-47255 > URL: https://issues.apache.org/jira/browse/SPARK-47255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_324[7-9]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter
Kent Yao created SPARK-47324: Summary: Call convertJavaTimestampToTimeStamp in Array getter Key: SPARK-47324 URL: https://issues.apache.org/jira/browse/SPARK-47324 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47255) Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]
[ https://issues.apache.org/jira/browse/SPARK-47255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47255: --- Labels: pull-request-available starter (was: starter) > Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9] > - > > Key: SPARK-47255 > URL: https://issues.apache.org/jira/browse/SPARK-47255 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Priority: Minor > Labels: pull-request-available, starter > > Choose a proper name for the error class *_LEGACY_ERROR_TEMP_324[7-9]* > defined in {*}core/src/main/resources/error/error-classes.json{*}. The name > should be short but complete (look at the example in error-classes.json). > Add a test which triggers the error from user code if such test still doesn't > exist. Check exception fields by using {*}checkError(){*}. The last function > checks valuable error fields only, and avoids dependencies from error text > message. In this way, tech editors can modify error format in > error-classes.json, and don't worry of Spark's internal tests. Migrate other > tests that might trigger the error onto checkError(). > If you cannot reproduce the error from user space (using SQL query), replace > the error by an internal error, see {*}SparkException.internalError(){*}. > Improve the error message format in error-classes.json if the current is not > clear. Propose a solution to users how to avoid and fix such kind of errors. > Please, look at the PR below as examples: > * [https://github.com/apache/spark/pull/38685] > * [https://github.com/apache/spark/pull/38656] > * [https://github.com/apache/spark/pull/38490] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter
[ https://issues.apache.org/jira/browse/SPARK-47324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-47324: - Affects Version/s: 4.0.0 (was: 3.5.1) > Call convertJavaTimestampToTimeStamp in Array getter > > > Key: SPARK-47324 > URL: https://issues.apache.org/jira/browse/SPARK-47324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47316) Fix TimestampNTZ in Postgres Array
[ https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-47316. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45418 [https://github.com/apache/spark/pull/45418] > Fix TimestampNTZ in Postgres Array > --- > > Key: SPARK-47316 > URL: https://issues.apache.org/jira/browse/SPARK-47316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47316) Fix TimestampNTZ in Postgres Array
[ https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-47316: Assignee: Kent Yao > Fix TimestampNTZ in Postgres Array > --- > > Key: SPARK-47316 > URL: https://issues.apache.org/jira/browse/SPARK-47316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47302) Collation name should be identifier
[ https://issues.apache.org/jira/browse/SPARK-47302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-47302. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45405 [https://github.com/apache/spark/pull/45405] > Collation name should be identifier > --- > > Key: SPARK-47302 > URL: https://issues.apache.org/jira/browse/SPARK-47302 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently collation names are parsed as string literals. > In spec they should be multi part identifiers (see spec linked with root > collation jira). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47302) Collation name should be identifier
[ https://issues.apache.org/jira/browse/SPARK-47302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-47302: Assignee: Aleksandar Tomic > Collation name should be identifier > --- > > Key: SPARK-47302 > URL: https://issues.apache.org/jira/browse/SPARK-47302 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Assignee: Aleksandar Tomic >Priority: Major > Labels: pull-request-available > > Currently collation names are parsed as string literals. > In spec they should be multi part identifiers (see spec linked with root > collation jira). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47325) Use the latest buf-setup-action in github workflow
BingKun Pan created SPARK-47325: --- Summary: Use the latest buf-setup-action in github workflow Key: SPARK-47325 URL: https://issues.apache.org/jira/browse/SPARK-47325 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47325) Use the latest buf-setup-action in github workflow
[ https://issues.apache.org/jira/browse/SPARK-47325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-47325: Component/s: Project Infra (was: Build) > Use the latest buf-setup-action in github workflow > -- > > Key: SPARK-47325 > URL: https://issues.apache.org/jira/browse/SPARK-47325 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47325) Use the latest buf-setup-action in github workflow
[ https://issues.apache.org/jira/browse/SPARK-47325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47325: --- Labels: pull-request-available (was: ) > Use the latest buf-setup-action in github workflow > -- > > Key: SPARK-47325 > URL: https://issues.apache.org/jira/browse/SPARK-47325 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`
[ https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-47322: Assignee: Ruifeng Zheng > Make `withColumnsRenamed` duplicated column name handling consisten with > `withColumnRenamed` > - > > Key: SPARK-47322 > URL: https://issues.apache.org/jira/browse/SPARK-47322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`
[ https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-47322. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45431 [https://github.com/apache/spark/pull/45431] > Make `withColumnsRenamed` duplicated column name handling consisten with > `withColumnRenamed` > - > > Key: SPARK-47322 > URL: https://issues.apache.org/jira/browse/SPARK-47322 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47326) Moving tests to related Suites
Mihailo Milosevic created SPARK-47326: - Summary: Moving tests to related Suites Key: SPARK-47326 URL: https://issues.apache.org/jira/browse/SPARK-47326 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Mihailo Milosevic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter
[ https://issues.apache.org/jira/browse/SPARK-47324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47324: --- Labels: pull-request-available (was: ) > Call convertJavaTimestampToTimeStamp in Array getter > > > Key: SPARK-47324 > URL: https://issues.apache.org/jira/browse/SPARK-47324 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47286) IN operator support
[ https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824786#comment-17824786 ] Gideon P commented on SPARK-47286: -- [~dbatomic] can I raise a pr to implement this one? > IN operator support > --- > > Key: SPARK-47286 > URL: https://issues.apache.org/jira/browse/SPARK-47286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > > At this point following query works fine: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show() > ``` > But if we were to miss explicit collate or even mix collations: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb'").show() > ``` > Query would still run and return invalid results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47327) Fix thread safety issue in ICU Collator
Stefan Kandic created SPARK-47327: - Summary: Fix thread safety issue in ICU Collator Key: SPARK-47327 URL: https://issues.apache.org/jira/browse/SPARK-47327 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic ICU Collator is not thread-safe by default so we have to freeze it in order to produce correct results in multi-threaded environment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47327) Fix thread safety issue in ICU Collator
[ https://issues.apache.org/jira/browse/SPARK-47327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47327: --- Labels: pull-request-available (was: ) > Fix thread safety issue in ICU Collator > --- > > Key: SPARK-47327 > URL: https://issues.apache.org/jira/browse/SPARK-47327 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > > ICU Collator is not thread-safe by default so we have to freeze it in order > to produce correct results in multi-threaded environment -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47313) scala.MatchError should be treated as internal error
[ https://issues.apache.org/jira/browse/SPARK-47313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47313: --- Labels: pull-request-available (was: ) > scala.MatchError should be treated as internal error > > > Key: SPARK-47313 > URL: https://issues.apache.org/jira/browse/SPARK-47313 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > > We should update `QueryExecution.toInternalError` to handle scala.MatchError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47286) IN operator support
[ https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824795#comment-17824795 ] Aleksandar Tomic commented on SPARK-47286: -- [~gpgp] There is already a PR that should handle `IN` among other cases: https://github.com/apache/spark/pull/45383 But, if you are interested in contributing to collation track I can propose two tracks: 1) Creation of benchmarking suites. Current implementation of collators is not super perf efficient. Once we get to stable implementation perf optimizations will be one track. 2) String expression support - you can talk to [~uros-db] about this. > IN operator support > --- > > Key: SPARK-47286 > URL: https://issues.apache.org/jira/browse/SPARK-47286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > > At this point following query works fine: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show() > ``` > But if we were to miss explicit collate or even mix collations: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb'").show() > ``` > Query would still run and return invalid results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47286) IN operator support
[ https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824811#comment-17824811 ] Gideon P commented on SPARK-47286: -- [~dbatomic] Awesome, I will create the benchmarking suite. I will let you know if I have questions and will let you know once there something for y'all to review. Please assign https://issues.apache.org/jira/browse/SPARK-46840 to me. Thanks so much for taking me on! > IN operator support > --- > > Key: SPARK-47286 > URL: https://issues.apache.org/jira/browse/SPARK-47286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Aleksandar Tomic >Priority: Major > > At this point following query works fine: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show() > ``` > But if we were to miss explicit collate or even mix collations: > ``` > sql("select * from t1 where ucs_basic_lcase in ('aaa' collate > 'ucs_basic_lcase', 'bbb'").show() > ``` > Query would still run and return invalid results. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47328) Change utf8 collation names to UTF8_BINARY
Stefan Kandic created SPARK-47328: - Summary: Change utf8 collation names to UTF8_BINARY Key: SPARK-47328 URL: https://issues.apache.org/jira/browse/SPARK-47328 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Stefan Kandic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47328) Change utf8 collation names to UTF8_BINARY
[ https://issues.apache.org/jira/browse/SPARK-47328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47328: --- Labels: pull-request-available (was: ) > Change utf8 collation names to UTF8_BINARY > -- > > Key: SPARK-47328 > URL: https://issues.apache.org/jira/browse/SPARK-47328 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Stefan Kandic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44115) Upgrade Apache ORC to 2.0
[ https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44115: --- Labels: pull-request-available (was: ) > Upgrade Apache ORC to 2.0 > - > > Key: SPARK-44115 > URL: https://issues.apache.org/jira/browse/SPARK-44115 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache ORC community has the following release cycles which are synchronized > with Apache Spark releases. > * ORC v2.0.0 (next year) for Apache Spark 4.0.x > * ORC v1.9.0 (this month) for Apache Spark 3.5.x > * ORC v1.8.x for Apache Spark 3.4.x > * ORC v1.7.x for Apache Spark 3.3.x > * ORC v1.6.x for Apache Spark 3.2.x -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44115) Upgrade Apache ORC to 2.0
[ https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-44115: - Assignee: Dongjoon Hyun > Upgrade Apache ORC to 2.0 > - > > Key: SPARK-44115 > URL: https://issues.apache.org/jira/browse/SPARK-44115 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache ORC community has the following release cycles which are synchronized > with Apache Spark releases. > * ORC v2.0.0 (next year) for Apache Spark 4.0.x > * ORC v1.9.0 (this month) for Apache Spark 3.5.x > * ORC v1.8.x for Apache Spark 3.4.x > * ORC v1.7.x for Apache Spark 3.3.x > * ORC v1.6.x for Apache Spark 3.2.x -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44115) Upgrade Apache ORC to 2.0
[ https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824844#comment-17824844 ] Dongjoon Hyun commented on SPARK-44115: --- Hi, [~engrravijain] . Thank you. However, this is a cross-project activity. As a member of both Apache ORC PMC and Apache Spark PMC, I've been working in Apache ORC community as the release manager of Apache ORC 2.0.0 in the following link. Today, the vote passed and I finally released it. - [https://github.com/apache/orc/issues/1669] > Upgrade Apache ORC to 2.0 > - > > Key: SPARK-44115 > URL: https://issues.apache.org/jira/browse/SPARK-44115 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache ORC community has the following release cycles which are synchronized > with Apache Spark releases. > * ORC v2.0.0 (next year) for Apache Spark 4.0.x > * ORC v1.9.0 (this month) for Apache Spark 3.5.x > * ORC v1.8.x for Apache Spark 3.4.x > * ORC v1.7.x for Apache Spark 3.3.x > * ORC v1.6.x for Apache Spark 3.2.x -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47330) XML: Add XmlExpressionsSuite
Yousof Hosny created SPARK-47330: Summary: XML: Add XmlExpressionsSuite Key: SPARK-47330 URL: https://issues.apache.org/jira/browse/SPARK-47330 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Yousof Hosny Convert JsonExpressionsSuite.scala to XML equivalent. Note that XML doesn’t implement all json functions like {{{}json_tuple{}}}, {{{}get_json_object{}}}, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47330) XML: Add XmlExpressionsSuite
[ https://issues.apache.org/jira/browse/SPARK-47330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47330: --- Labels: pull-request-available (was: ) > XML: Add XmlExpressionsSuite > - > > Key: SPARK-47330 > URL: https://issues.apache.org/jira/browse/SPARK-47330 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yousof Hosny >Priority: Major > Labels: pull-request-available > > Convert JsonExpressionsSuite.scala to XML equivalent. Note that XML doesn’t > implement all json functions like {{{}json_tuple{}}}, > {{{}get_json_object{}}}, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44115) Upgrade Apache ORC to 2.0
[ https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-44115. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45443 [https://github.com/apache/spark/pull/45443] > Upgrade Apache ORC to 2.0 > - > > Key: SPARK-44115 > URL: https://issues.apache.org/jira/browse/SPARK-44115 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Apache ORC community has the following release cycles which are synchronized > with Apache Spark releases. > * ORC v2.0.0 (next year) for Apache Spark 4.0.x > * ORC v1.9.0 (this month) for Apache Spark 3.5.x > * ORC v1.8.x for Apache Spark 3.4.x > * ORC v1.7.x for Apache Spark 3.3.x > * ORC v1.6.x for Apache Spark 3.2.x -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824877#comment-17824877 ] Asif commented on SPARK-47320: -- Opened following PR [https://github.com/apache/spark/pull/45446|https://github.com/apache/spark/pull/45446] > Datasets involving self joins behave in an inconsistent and unintuitive > manner > > > Key: SPARK-47320 > URL: https://issues.apache.org/jira/browse/SPARK-47320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > > The behaviour of Datasets involving self joins behave in an unintuitive > manner in terms when AnalysisException is thrown due to ambiguity and when it > works. > Found situations where join order swapping causes query to throw Ambiguity > related exceptions which otherwise passes. Some of the Datasets which from > user perspective are un-ambiguous will result in Analysis Exception getting > thrown. > After testing and fixing a bug , I think the issue lies in inconsistency in > determining what constitutes ambiguous and what is un-ambiguous. > There are two ways to look at resolution regarding ambiguity > 1) ExprId of attributes : This is unintuitive approach as spark users do not > bother with the ExprIds > 2) Column Extraction from the Dataset using df(col) api : Which is the user > visible/understandable Point of View. So determining ambiguity should be > based on this. What is Logically unambiguous from users perspective ( > assuming its is logically correct) , should also be the basis of spark > product, to decide on un-ambiguity. > For Example: > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === > df1("a")).select(df1("a")) > {quote} > The above code from perspective #1 should throw ambiguity exception, because > the join condition and projection of df3 dataframe, has df1("a) which has > exprId which matches both df1Joindf2 and df1. > But if we look is from perspective of Dataset used to get column, which is > the intent of the user, the expectation is that df1("a) should be resolved > to Dataset df1 being joined, and not > df1Joindf2. If user intended "a" from df1Joindf2, then would have used > df1Joindf2("a") > So In this case , current spark throws Exception as it is using resolution > based on # 1 > But the below Dataframe by the above logic, should also throw Ambiguity > Exception but it passes > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) > {quote} > The difference in the 2 cases is that in the first case , select is present. > But in the 2nd query, select is not there. > So this implies that in 1st case the df1("a") in projection is causing > ambiguity issue, but same reference in 2nd case, used just in condition, is > considered un-ambiguous. > IMHO , the ambiguity identification criteria should be based totally on #2 > and consistently. > In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of > the tests which are being considered ambiguous ( on # 1 criteria) become > un-ambiguous using (#2) criteria. > for eg: > {quote} > test("SPARK-28344: fail ambiguous self join - column ref in join condition") { > val df1 = spark.range(3) > val df2 = df1.filter($"id" > 0) > @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest > with SharedSparkSession { > withSQLConf( > SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", > SQLConf.CROSS_JOINS_ENABLED.key -> "true") { > assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) > } > } > {quote} > The above test should not have ambiguity exception thrown as df1("id") and > df2("id") are un-ambiguous from perspective of Dataset -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Description: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a")) {quote} The above code from perspective #1 should throw ambiguity exception, because the join condition and projection of df3 dataframe, has df1("a) which has exprId which matches both df1Joindf2 and df1. But if we look is from perspective of Dataset used to get column, which is the intent of the user, the expectation is that df1("a) should be resolved to Dataset df1 being joined, and not df1Joindf2. If user intended "a" from df1Joindf2, then would have used df1Joindf2("a") So In this case , current spark throws Exception as it is using resolution based on # 1 But the below Dataframe by the above logic, should also throw Ambiguity Exception but it passes {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) {quote} The difference in the 2 cases is that in the first case , select is present. But in the 2nd query, select is not there. So this implies that in 1st case the df1("a") in projection is causing ambiguity issue, but same reference in 2nd case, used just in condition, is considered un-ambiguous. IMHO , the ambiguity identification criteria should be based totally on #2 and consistently. In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the tests which are being considered ambiguous ( on # 1 criteria) become un-ambiguous using (#2) criteria. for eg: {quote} test("SPARK-28344: fail ambiguous self join - column ref in join condition") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest with SharedSparkSession { withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) } } {quote} The above test should not have ambiguity exception thrown as df1("id") and df2("id") are un-ambiguous from perspective of Dataset There is an existing test in DataFrameSelfJoinSuite ` test("SPARK-28344: fail ambiguous self join - column ref in Project") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // `df2("id")` actually points to the column of `df1`. checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 2).map(Row(_))) // Alias the dataframe and use qualified column names can fix ambiguous self-join. val aliasedDf1 = df1.alias("left") val aliasedDf2 = df2.as("right") checkAnswer( aliasedDf1.join(aliasedDf2).select($"right.id"), Seq(1, 1, 1, 2, 2, 2).map(Row(_))) } withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // Assertion1 : existing assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) // Assertion2 : added by me assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) } } ` Here the Assertion1 passes ( that is ambiguous exception is thrown) But the Assertion2 fails ( that is no ambiguous exception is thrown) The only chnage is the join order Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity. was: The behaviour of Datase
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Description: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a")) {quote} The above code from perspective #1 should throw ambiguity exception, because the join condition and projection of df3 dataframe, has df1("a) which has exprId which matches both df1Joindf2 and df1. But if we look is from perspective of Dataset used to get column, which is the intent of the user, the expectation is that df1("a) should be resolved to Dataset df1 being joined, and not df1Joindf2. If user intended "a" from df1Joindf2, then would have used df1Joindf2("a") So In this case , current spark throws Exception as it is using resolution based on # 1 But the below Dataframe by the above logic, should also throw Ambiguity Exception but it passes {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) {quote} The difference in the 2 cases is that in the first case , select is present. But in the 2nd query, select is not there. So this implies that in 1st case the df1("a") in projection is causing ambiguity issue, but same reference in 2nd case, used just in condition, is considered un-ambiguous. IMHO , the ambiguity identification criteria should be based totally on #2 and consistently. In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the tests which are being considered ambiguous ( on # 1 criteria) become un-ambiguous using (#2) criteria. for eg: {quote} test("SPARK-28344: fail ambiguous self join - column ref in join condition") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest with SharedSparkSession { withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) } } {quote} The above test should not have ambiguity exception thrown as df1("id") and df2("id") are un-ambiguous from perspective of Dataset There is an existing test in DataFrameSelfJoinSuite {quote} test("SPARK-28344: fail ambiguous self join - column ref in Project") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // `df2("id")` actually points to the column of `df1`. checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 2).map(Row(_))) // Alias the dataframe and use qualified column names can fix ambiguous self-join. val aliasedDf1 = df1.alias("left") val aliasedDf2 = df2.as("right") checkAnswer( aliasedDf1.join(aliasedDf2).select($"right.id"), Seq(1, 1, 1, 2, 2, 2).map(Row(_))) } withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // Assertion1 : existing assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) // Assertion2 : added by me assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) } } {quote} Here the Assertion1 passes ( that is ambiguous exception is thrown) But the Assertion2 fails ( that is no ambiguous exception is thrown) The only chnage is the join order Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity. was: The behavi
[jira] [Created] (SPARK-47331) Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2.
Jing Zhan created SPARK-47331: - Summary: Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2. Key: SPARK-47331 URL: https://issues.apache.org/jira/browse/SPARK-47331 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Jing Zhan In the new operator for arbitrary state-v2, we cannot rely on the session/encoder being available since the initialization for the various state instances happens on the executors. Also, we can only support limited state types with the available encoders. Hence, for the state serialization, we propose to serialize primitives/case classes/POJO with SQL encoder. Leveraging SQL encoder can speed up the serialization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Description: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a")) {quote} The above code from perspective #1 should throw ambiguity exception, because the join condition and projection of df3 dataframe, has df1("a) which has exprId which matches both df1Joindf2 and df1. But if we look is from perspective of Dataset used to get column, which is the intent of the user, the expectation is that df1("a) should be resolved to Dataset df1 being joined, and not df1Joindf2. If user intended "a" from df1Joindf2, then would have used df1Joindf2("a") So In this case , current spark throws Exception as it is using resolution based on # 1 But the below Dataframe by the above logic, should also throw Ambiguity Exception but it passes {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) {quote} The difference in the 2 cases is that in the first case , select is present. But in the 2nd query, select is not there. So this implies that in 1st case the df1("a") in projection is causing ambiguity issue, but same reference in 2nd case, used just in condition, is considered un-ambiguous. IMHO , the ambiguity identification criteria should be based totally on #2 and consistently. In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the tests which are being considered ambiguous ( on # 1 criteria) become un-ambiguous using (#2) criteria. for eg: test("SPARK-28344: fail ambiguous self join - column ref in join condition") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest with SharedSparkSession { withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) } } {quote} The above test should not have ambiguity exception thrown as df1("id") and df2("id") are un-ambiguous from perspective of Dataset There is an existing test in DataFrameSelfJoinSuite ``` test("SPARK-28344: fail ambiguous self join - column ref in Project") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // `df2("id")` actually points to the column of `df1`. checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 2).map(Row(_))) // Alias the dataframe and use qualified column names can fix ambiguous self-join. val aliasedDf1 = df1.alias("left") val aliasedDf2 = df2.as("right") checkAnswer( aliasedDf1.join(aliasedDf2).select($"right.id"), Seq(1, 1, 1, 2, 2, 2).map(Row(_))) } withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { // Assertion1 : existing assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) // Assertion2 : added by me assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) } } ``` Here the Assertion1 passes ( that is ambiguous exception is thrown) But the Assertion2 fails ( that is no ambiguous exception is thrown) The only chnage is the join order Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity. was: The behaviour of Datasets
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Description: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a")) {quote} The above code from perspective #1 should throw ambiguity exception, because the join condition and projection of df3 dataframe, has df1("a) which has exprId which matches both df1Joindf2 and df1. But if we look is from perspective of Dataset used to get column, which is the intent of the user, the expectation is that df1("a) should be resolved to Dataset df1 being joined, and not df1Joindf2. If user intended "a" from df1Joindf2, then would have used df1Joindf2("a") So In this case , current spark throws Exception as it is using resolution based on # 1 But the below Dataframe by the above logic, should also throw Ambiguity Exception but it passes {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) {quote} The difference in the 2 cases is that in the first case , select is present. But in the 2nd query, select is not there. So this implies that in 1st case the df1("a") in projection is causing ambiguity issue, but same reference in 2nd case, used just in condition, is considered un-ambiguous. IMHO , the ambiguity identification criteria should be based totally on #2 and consistently. In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the tests which are being considered ambiguous ( on # 1 criteria) become un-ambiguous using (#2) criteria. for eg: {quote} test("SPARK-28344: fail ambiguous self join - column ref in join condition") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest with SharedSparkSession { withSQLConf( SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true", SQLConf.CROSS_JOINS_ENABLED.key -> "true") { assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id"))) } } {quote} The above test should not have ambiguity exception thrown as df1("id") and df2("id") are un-ambiguous from perspective of Dataset There is an existing test in DataFrameSelfJoinSuite {quote} test("SPARK-28344: fail ambiguous self join - column ref in Project") { val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) // Assertion1 : existing assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) // Assertion2 : added by me assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) } {quote} Here the Assertion1 passes ( that is ambiguous exception is thrown) But the Assertion2 fails ( that is no ambiguous exception is thrown) The only chnage is the join order Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity. was: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approac
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Description: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a")) {quote} The above code from perspective #1 should throw ambiguity exception, because the join condition and projection of df3 dataframe, has df1("a) which has exprId which matches both df1Joindf2 and df1. But if we look is from perspective of Dataset used to get column, which is the intent of the user, the expectation is that df1("a) should be resolved to Dataset df1 being joined, and not df1Joindf2. If user intended "a" from df1Joindf2, then would have used df1Joindf2("a") So In this case , current spark throws Exception as it is using resolution based on # 1 But the below Dataframe by the above logic, should also throw Ambiguity Exception but it passes {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) {quote} The difference in the 2 cases is that in the first case , select is present. But in the 2nd query, select is not there. So this implies that in 1st case the df1("a") in projection is causing ambiguity issue, but same reference in 2nd case, used just in condition, is considered un-ambiguous. IMHO , the ambiguity identification criteria should be based totally on #2 and consistently. In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the tests which are being considered ambiguous ( on # 1 criteria) become un-ambiguous using (#2) criteria. There is an existing test in DataFrameSelfJoinSuite {quote} test("SPARK-28344: fail ambiguous self join - column ref in Project") val df1 = spark.range(3) val df2 = df1.filter($"id" > 0) Assertion1 : existing assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) Assertion2 : added by me assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) } {quote} Here the Assertion1 passes ( that is ambiguous exception is thrown) But the Assertion2 fails ( that is no ambiguous exception is thrown) The only chnage is the join order Logically both the assertions are invalid ( In the sense both should NOT be throwing Exception as from the user's perspective there is no ambiguity. was: The behaviour of Datasets involving self joins behave in an unintuitive manner in terms when AnalysisException is thrown due to ambiguity and when it works. Found situations where join order swapping causes query to throw Ambiguity related exceptions which otherwise passes. Some of the Datasets which from user perspective are un-ambiguous will result in Analysis Exception getting thrown. After testing and fixing a bug , I think the issue lies in inconsistency in determining what constitutes ambiguous and what is un-ambiguous. There are two ways to look at resolution regarding ambiguity 1) ExprId of attributes : This is unintuitive approach as spark users do not bother with the ExprIds 2) Column Extraction from the Dataset using df(col) api : Which is the user visible/understandable Point of View. So determining ambiguity should be based on this. What is Logically unambiguous from users perspective ( assuming its is logically correct) , should also be the basis of spark product, to decide on un-ambiguity. For Example: {quote} val df1 = Seq((1, 2)).toDF("a", "b") val df2 = Seq((1, 2)).toDF("aa", "bb") val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), df2("aa"), df1("b")) val df3 = df1Joindf2.join(df1, df
[jira] [Updated] (SPARK-47331) Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2.
[ https://issues.apache.org/jira/browse/SPARK-47331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47331: --- Labels: pull-request-available (was: ) > Serialization using case classes/primitives/POJO based on SQL encoder for > Arbitrary State API v2. > -- > > Key: SPARK-47331 > URL: https://issues.apache.org/jira/browse/SPARK-47331 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Jing Zhan >Priority: Major > Labels: pull-request-available > > In the new operator for arbitrary state-v2, we cannot rely on the > session/encoder being available since the initialization for the various > state instances happens on the executors. Also, we can only support limited > state types with the available encoders. Hence, for the state serialization, > we propose to serialize primitives/case classes/POJO with SQL encoder. > Leveraging SQL encoder can speed up the serialization. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47332) StreamingPythonRunner don't need redundant logic for starting python process
Wei Liu created SPARK-47332: --- Summary: StreamingPythonRunner don't need redundant logic for starting python process Key: SPARK-47332 URL: https://issues.apache.org/jira/browse/SPARK-47332 Project: Spark Issue Type: New Feature Components: Connect, SS, Structured Streaming Affects Versions: 4.0.0 Reporter: Wei Liu https://github.com/apache/spark/pull/45023#discussion_r1516609093 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner
[ https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-47320: - Labels: pull-request-available (was: ) > Datasets involving self joins behave in an inconsistent and unintuitive > manner > > > Key: SPARK-47320 > URL: https://issues.apache.org/jira/browse/SPARK-47320 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.1 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > The behaviour of Datasets involving self joins behave in an unintuitive > manner in terms when AnalysisException is thrown due to ambiguity and when it > works. > Found situations where join order swapping causes query to throw Ambiguity > related exceptions which otherwise passes. Some of the Datasets which from > user perspective are un-ambiguous will result in Analysis Exception getting > thrown. > After testing and fixing a bug , I think the issue lies in inconsistency in > determining what constitutes ambiguous and what is un-ambiguous. > There are two ways to look at resolution regarding ambiguity > 1) ExprId of attributes : This is unintuitive approach as spark users do not > bother with the ExprIds > 2) Column Extraction from the Dataset using df(col) api : Which is the user > visible/understandable Point of View. So determining ambiguity should be > based on this. What is Logically unambiguous from users perspective ( > assuming its is logically correct) , should also be the basis of spark > product, to decide on un-ambiguity. > For Example: > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === > df1("a")).select(df1("a")) > {quote} > The above code from perspective #1 should throw ambiguity exception, because > the join condition and projection of df3 dataframe, has df1("a) which has > exprId which matches both df1Joindf2 and df1. > But if we look is from perspective of Dataset used to get column, which is > the intent of the user, the expectation is that df1("a) should be resolved > to Dataset df1 being joined, and not > df1Joindf2. If user intended "a" from df1Joindf2, then would have used > df1Joindf2("a") > So In this case , current spark throws Exception as it is using resolution > based on # 1 > But the below Dataframe by the above logic, should also throw Ambiguity > Exception but it passes > {quote} > val df1 = Seq((1, 2)).toDF("a", "b") > val df2 = Seq((1, 2)).toDF("aa", "bb") > val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"), > df2("aa"), df1("b")) > df1Joindf2.join(df1, df1Joindf2("a") === df1("a")) > {quote} > The difference in the 2 cases is that in the first case , select is present. > But in the 2nd query, select is not there. > So this implies that in 1st case the df1("a") in projection is causing > ambiguity issue, but same reference in 2nd case, used just in condition, is > considered un-ambiguous. > IMHO , the ambiguity identification criteria should be based totally on #2 > and consistently. > In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of > the tests which are being considered ambiguous ( on # 1 criteria) become > un-ambiguous using (#2) criteria. > There is an existing test in DataFrameSelfJoinSuite > {quote} > test("SPARK-28344: fail ambiguous self join - column ref in Project") > val df1 = spark.range(3) > val df2 = df1.filter($"id" > 0) > Assertion1 : existing > assertAmbiguousSelfJoin(df1.join(df2).select(df2("id"))) > Assertion2 : added by me > assertAmbiguousSelfJoin(df2.join(df1).select(df2("id"))) > } > {quote} > Here the Assertion1 passes ( that is ambiguous exception is thrown) > But the Assertion2 fails ( that is no ambiguous exception is thrown) > The only chnage is the join order > Logically both the assertions are invalid ( In the sense both should NOT be > throwing Exception as from the user's perspective there is no ambiguity. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org