[jira] [Resolved] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT

2024-03-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47265.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45368
[https://github.com/apache/spark/pull/45368]

> Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., 
> columns: Array[Column], ...)` in UT
> 
>
> Key: SPARK-47265
> URL: https://issues.apache.org/jira/browse/SPARK-47265
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47265) Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., columns: Array[Column], ...)` in UT

2024-03-08 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47265:
---

Assignee: BingKun Pan

> Use `createTable(..., schema: StructType, ...)` instead of `createTable(..., 
> columns: Array[Column], ...)` in UT
> 
>
> Key: SPARK-47265
> URL: https://issues.apache.org/jira/browse/SPARK-47265
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Summary: Improve missingInput calculation  (was: Fix missingInput 
calculation)

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Description: {{QueryPlan.missingInput()}} calculation seems to be the root 
cause of {{DeduplicateRelations}} slownes. Let's try to improve it.

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{QueryPlan.missingInput()}} calculation seems to be the root cause of 
> {{DeduplicateRelations}} slownes. Let's try to improve it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47319) Improve missingInput calculation

2024-03-08 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-47319:
---
Description: {{QueryPlan.missingInput()}} calculation seems to be the root 
cause of {{DeduplicateRelations}} slownes in some cases. Let's try to improve 
it.  (was: {{QueryPlan.missingInput()}} calculation seems to be the root cause 
of {{DeduplicateRelations}} slownes. Let's try to improve it.)

> Improve missingInput calculation
> 
>
> Key: SPARK-47319
> URL: https://issues.apache.org/jira/browse/SPARK-47319
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Peter Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> {{QueryPlan.missingInput()}} calculation seems to be the root cause of 
> {{DeduplicateRelations}} slownes in some cases. Let's try to improve it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47322:
--

Assignee: Apache Spark

> Make `withColumnsRenamed` duplicated column name handling consisten with 
> `withColumnRenamed` 
> -
>
> Key: SPARK-47322
> URL: https://issues.apache.org/jira/browse/SPARK-47322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47322:
--

Assignee: (was: Apache Spark)

> Make `withColumnsRenamed` duplicated column name handling consisten with 
> `withColumnRenamed` 
> -
>
> Key: SPARK-47322
> URL: https://issues.apache.org/jira/browse/SPARK-47322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47254) Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47254:
--

Assignee: (was: Apache Spark)

> Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]
> -
>
> Key: SPARK-47254
> URL: https://issues.apache.org/jira/browse/SPARK-47254
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_325[1-9]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47254) Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47254:
--

Assignee: Apache Spark

> Assign names to the error classes _LEGACY_ERROR_TEMP_325[1-9]
> -
>
> Key: SPARK-47254
> URL: https://issues.apache.org/jira/browse/SPARK-47254
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_325[1-9]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47316) Fix TimestampNTZ in Postgres Array

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-47316:
--

Assignee: (was: Apache Spark)

> Fix TimestampNTZ in Postgres Array 
> ---
>
> Key: SPARK-47316
> URL: https://issues.apache.org/jira/browse/SPARK-47316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47255) Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]

2024-03-08 Thread Milan Dankovic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824692#comment-17824692
 ] 

Milan Dankovic commented on SPARK-47255:


I am working on it.

> Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]
> -
>
> Key: SPARK-47255
> URL: https://issues.apache.org/jira/browse/SPARK-47255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_324[7-9]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter

2024-03-08 Thread Kent Yao (Jira)
Kent Yao created SPARK-47324:


 Summary: Call convertJavaTimestampToTimeStamp in Array getter
 Key: SPARK-47324
 URL: https://issues.apache.org/jira/browse/SPARK-47324
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47255) Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47255:
---
Labels: pull-request-available starter  (was: starter)

> Assign names to the error classes _LEGACY_ERROR_TEMP_324[7-9]
> -
>
> Key: SPARK-47255
> URL: https://issues.apache.org/jira/browse/SPARK-47255
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: pull-request-available, starter
>
> Choose a proper name for the error class *_LEGACY_ERROR_TEMP_324[7-9]* 
> defined in {*}core/src/main/resources/error/error-classes.json{*}. The name 
> should be short but complete (look at the example in error-classes.json).
> Add a test which triggers the error from user code if such test still doesn't 
> exist. Check exception fields by using {*}checkError(){*}. The last function 
> checks valuable error fields only, and avoids dependencies from error text 
> message. In this way, tech editors can modify error format in 
> error-classes.json, and don't worry of Spark's internal tests. Migrate other 
> tests that might trigger the error onto checkError().
> If you cannot reproduce the error from user space (using SQL query), replace 
> the error by an internal error, see {*}SparkException.internalError(){*}.
> Improve the error message format in error-classes.json if the current is not 
> clear. Propose a solution to users how to avoid and fix such kind of errors.
> Please, look at the PR below as examples:
>  * [https://github.com/apache/spark/pull/38685]
>  * [https://github.com/apache/spark/pull/38656]
>  * [https://github.com/apache/spark/pull/38490]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter

2024-03-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-47324:
-
Affects Version/s: 4.0.0
   (was: 3.5.1)

> Call convertJavaTimestampToTimeStamp in Array getter
> 
>
> Key: SPARK-47324
> URL: https://issues.apache.org/jira/browse/SPARK-47324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47316) Fix TimestampNTZ in Postgres Array

2024-03-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47316.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45418
[https://github.com/apache/spark/pull/45418]

> Fix TimestampNTZ in Postgres Array 
> ---
>
> Key: SPARK-47316
> URL: https://issues.apache.org/jira/browse/SPARK-47316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47316) Fix TimestampNTZ in Postgres Array

2024-03-08 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47316:


Assignee: Kent Yao

> Fix TimestampNTZ in Postgres Array 
> ---
>
> Key: SPARK-47316
> URL: https://issues.apache.org/jira/browse/SPARK-47316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47302) Collation name should be identifier

2024-03-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-47302.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45405
[https://github.com/apache/spark/pull/45405]

> Collation name should be identifier
> ---
>
> Key: SPARK-47302
> URL: https://issues.apache.org/jira/browse/SPARK-47302
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently collation names are parsed as string literals.
> In spec they should be multi part identifiers (see spec linked with root 
> collation jira).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47302) Collation name should be identifier

2024-03-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-47302:


Assignee: Aleksandar Tomic

> Collation name should be identifier
> ---
>
> Key: SPARK-47302
> URL: https://issues.apache.org/jira/browse/SPARK-47302
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Assignee: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>
> Currently collation names are parsed as string literals.
> In spec they should be multi part identifiers (see spec linked with root 
> collation jira).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47325) Use the latest buf-setup-action in github workflow

2024-03-08 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-47325:
---

 Summary: Use the latest buf-setup-action in github workflow
 Key: SPARK-47325
 URL: https://issues.apache.org/jira/browse/SPARK-47325
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47325) Use the latest buf-setup-action in github workflow

2024-03-08 Thread BingKun Pan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan updated SPARK-47325:

Component/s: Project Infra
 (was: Build)

> Use the latest buf-setup-action in github workflow
> --
>
> Key: SPARK-47325
> URL: https://issues.apache.org/jira/browse/SPARK-47325
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47325) Use the latest buf-setup-action in github workflow

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47325:
---
Labels: pull-request-available  (was: )

> Use the latest buf-setup-action in github workflow
> --
>
> Key: SPARK-47325
> URL: https://issues.apache.org/jira/browse/SPARK-47325
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`

2024-03-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-47322:


Assignee: Ruifeng Zheng

> Make `withColumnsRenamed` duplicated column name handling consisten with 
> `withColumnRenamed` 
> -
>
> Key: SPARK-47322
> URL: https://issues.apache.org/jira/browse/SPARK-47322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47322) Make `withColumnsRenamed` duplicated column name handling consisten with `withColumnRenamed`

2024-03-08 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47322.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45431
[https://github.com/apache/spark/pull/45431]

> Make `withColumnsRenamed` duplicated column name handling consisten with 
> `withColumnRenamed` 
> -
>
> Key: SPARK-47322
> URL: https://issues.apache.org/jira/browse/SPARK-47322
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47326) Moving tests to related Suites

2024-03-08 Thread Mihailo Milosevic (Jira)
Mihailo Milosevic created SPARK-47326:
-

 Summary: Moving tests to related Suites
 Key: SPARK-47326
 URL: https://issues.apache.org/jira/browse/SPARK-47326
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mihailo Milosevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47324) Call convertJavaTimestampToTimeStamp in Array getter

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47324:
---
Labels: pull-request-available  (was: )

> Call convertJavaTimestampToTimeStamp in Array getter
> 
>
> Key: SPARK-47324
> URL: https://issues.apache.org/jira/browse/SPARK-47324
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47286) IN operator support

2024-03-08 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824786#comment-17824786
 ] 

Gideon P commented on SPARK-47286:
--

[~dbatomic] can I raise a pr to implement this one? 

> IN operator support
> ---
>
> Key: SPARK-47286
> URL: https://issues.apache.org/jira/browse/SPARK-47286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>
> At this point following query works fine:
> ```
>  sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show()
> ```
> But if we were to miss explicit collate or even mix collations:
> ```
>   sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb'").show()
> ```
> Query would still run and return invalid results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47327) Fix thread safety issue in ICU Collator

2024-03-08 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-47327:
-

 Summary: Fix thread safety issue in ICU Collator
 Key: SPARK-47327
 URL: https://issues.apache.org/jira/browse/SPARK-47327
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic


ICU Collator is not thread-safe by default so we have to freeze it in order to 
produce correct results in multi-threaded environment



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47327) Fix thread safety issue in ICU Collator

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47327:
---
Labels: pull-request-available  (was: )

> Fix thread safety issue in ICU Collator
> ---
>
> Key: SPARK-47327
> URL: https://issues.apache.org/jira/browse/SPARK-47327
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> ICU Collator is not thread-safe by default so we have to freeze it in order 
> to produce correct results in multi-threaded environment



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47313) scala.MatchError should be treated as internal error

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47313:
---
Labels: pull-request-available  (was: )

> scala.MatchError should be treated as internal error
> 
>
> Key: SPARK-47313
> URL: https://issues.apache.org/jira/browse/SPARK-47313
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>
> We should update `QueryExecution.toInternalError` to handle scala.MatchError



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47286) IN operator support

2024-03-08 Thread Aleksandar Tomic (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824795#comment-17824795
 ] 

Aleksandar Tomic commented on SPARK-47286:
--

[~gpgp] There is already a PR that should handle `IN` among other cases:
https://github.com/apache/spark/pull/45383

But, if you are interested in contributing to collation track I can propose two 
tracks:
1) Creation of benchmarking suites. Current implementation of collators is not 
super perf efficient. Once we get to stable implementation perf optimizations 
will be one track.
2) String expression support - you can talk to [~uros-db] about this.

> IN operator support
> ---
>
> Key: SPARK-47286
> URL: https://issues.apache.org/jira/browse/SPARK-47286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>
> At this point following query works fine:
> ```
>  sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show()
> ```
> But if we were to miss explicit collate or even mix collations:
> ```
>   sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb'").show()
> ```
> Query would still run and return invalid results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47286) IN operator support

2024-03-08 Thread Gideon P (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824811#comment-17824811
 ] 

Gideon P commented on SPARK-47286:
--

[~dbatomic] Awesome, I will create the benchmarking suite. I will let you know 
if I have questions and will let you know once there something for y'all to 
review.   Please assign https://issues.apache.org/jira/browse/SPARK-46840 to 
me. Thanks so much for taking me on!

> IN operator support
> ---
>
> Key: SPARK-47286
> URL: https://issues.apache.org/jira/browse/SPARK-47286
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>
> At this point following query works fine:
> ```
>  sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb' collate 'ucs_basic_lcase')").show()
> ```
> But if we were to miss explicit collate or even mix collations:
> ```
>   sql("select * from t1 where ucs_basic_lcase in ('aaa' collate 
> 'ucs_basic_lcase', 'bbb'").show()
> ```
> Query would still run and return invalid results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47328) Change utf8 collation names to UTF8_BINARY

2024-03-08 Thread Stefan Kandic (Jira)
Stefan Kandic created SPARK-47328:
-

 Summary: Change utf8 collation names to UTF8_BINARY
 Key: SPARK-47328
 URL: https://issues.apache.org/jira/browse/SPARK-47328
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Stefan Kandic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47328) Change utf8 collation names to UTF8_BINARY

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47328:
---
Labels: pull-request-available  (was: )

> Change utf8 collation names to UTF8_BINARY
> --
>
> Key: SPARK-47328
> URL: https://issues.apache.org/jira/browse/SPARK-47328
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44115) Upgrade Apache ORC to 2.0

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44115:
---
Labels: pull-request-available  (was: )

> Upgrade Apache ORC to 2.0
> -
>
> Key: SPARK-44115
> URL: https://issues.apache.org/jira/browse/SPARK-44115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache ORC community has the following release cycles which are synchronized 
> with Apache Spark releases.
>  * ORC v2.0.0 (next year) for Apache Spark 4.0.x
>  * ORC v1.9.0 (this month) for Apache Spark 3.5.x
>  * ORC v1.8.x for Apache Spark 3.4.x
>  * ORC v1.7.x for Apache Spark 3.3.x
>  * ORC v1.6.x for Apache Spark 3.2.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44115) Upgrade Apache ORC to 2.0

2024-03-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44115:
-

Assignee: Dongjoon Hyun

> Upgrade Apache ORC to 2.0
> -
>
> Key: SPARK-44115
> URL: https://issues.apache.org/jira/browse/SPARK-44115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache ORC community has the following release cycles which are synchronized 
> with Apache Spark releases.
>  * ORC v2.0.0 (next year) for Apache Spark 4.0.x
>  * ORC v1.9.0 (this month) for Apache Spark 3.5.x
>  * ORC v1.8.x for Apache Spark 3.4.x
>  * ORC v1.7.x for Apache Spark 3.3.x
>  * ORC v1.6.x for Apache Spark 3.2.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44115) Upgrade Apache ORC to 2.0

2024-03-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824844#comment-17824844
 ] 

Dongjoon Hyun commented on SPARK-44115:
---

Hi, [~engrravijain] . Thank you. However, this is a cross-project activity. As 
a member of both Apache ORC PMC and Apache Spark PMC, I've been working in 
Apache ORC community as the release manager of Apache ORC 2.0.0 in the 
following link. Today, the vote passed and I finally released it.

- [https://github.com/apache/orc/issues/1669]

> Upgrade Apache ORC to 2.0
> -
>
> Key: SPARK-44115
> URL: https://issues.apache.org/jira/browse/SPARK-44115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache ORC community has the following release cycles which are synchronized 
> with Apache Spark releases.
>  * ORC v2.0.0 (next year) for Apache Spark 4.0.x
>  * ORC v1.9.0 (this month) for Apache Spark 3.5.x
>  * ORC v1.8.x for Apache Spark 3.4.x
>  * ORC v1.7.x for Apache Spark 3.3.x
>  * ORC v1.6.x for Apache Spark 3.2.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47330) XML: Add XmlExpressionsSuite

2024-03-08 Thread Yousof Hosny (Jira)
Yousof Hosny created SPARK-47330:


 Summary: XML: Add XmlExpressionsSuite 
 Key: SPARK-47330
 URL: https://issues.apache.org/jira/browse/SPARK-47330
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Yousof Hosny


Convert JsonExpressionsSuite.scala to XML equivalent. Note that XML doesn’t 
implement all json functions like {{{}json_tuple{}}}, {{{}get_json_object{}}}, 
etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47330) XML: Add XmlExpressionsSuite

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47330:
---
Labels: pull-request-available  (was: )

> XML: Add XmlExpressionsSuite 
> -
>
> Key: SPARK-47330
> URL: https://issues.apache.org/jira/browse/SPARK-47330
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yousof Hosny
>Priority: Major
>  Labels: pull-request-available
>
> Convert JsonExpressionsSuite.scala to XML equivalent. Note that XML doesn’t 
> implement all json functions like {{{}json_tuple{}}}, 
> {{{}get_json_object{}}}, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44115) Upgrade Apache ORC to 2.0

2024-03-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44115.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45443
[https://github.com/apache/spark/pull/45443]

> Upgrade Apache ORC to 2.0
> -
>
> Key: SPARK-44115
> URL: https://issues.apache.org/jira/browse/SPARK-44115
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Apache ORC community has the following release cycles which are synchronized 
> with Apache Spark releases.
>  * ORC v2.0.0 (next year) for Apache Spark 4.0.x
>  * ORC v1.9.0 (this month) for Apache Spark 3.5.x
>  * ORC v1.8.x for Apache Spark 3.4.x
>  * ORC v1.7.x for Apache Spark 3.3.x
>  * ORC v1.6.x for Apache Spark 3.2.x



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824877#comment-17824877
 ] 

Asif commented on SPARK-47320:
--

Opened following PR
[https://github.com/apache/spark/pull/45446|https://github.com/apache/spark/pull/45446]

> Datasets involving self joins behave in an inconsistent and unintuitive  
> manner 
> 
>
> Key: SPARK-47320
> URL: https://issues.apache.org/jira/browse/SPARK-47320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>
> The behaviour of Datasets involving self joins behave in an unintuitive 
> manner in terms when AnalysisException is thrown due to ambiguity and when it 
> works.
> Found situations where join order swapping causes query to throw Ambiguity 
> related exceptions which otherwise passes.  Some of the Datasets which from 
> user perspective are un-ambiguous will result in Analysis Exception getting 
> thrown.
> After testing and fixing a bug , I think the issue lies in inconsistency in 
> determining what constitutes ambiguous and what is un-ambiguous.
> There are two ways to look at resolution regarding ambiguity
> 1) ExprId of attributes : This is unintuitive approach as spark users do not 
> bother with the ExprIds
> 2) Column Extraction from the Dataset using df(col) api : Which is the user 
> visible/understandable Point of View.  So determining ambiguity should be 
> based on this. What is Logically unambiguous from users perspective ( 
> assuming its is logically correct) , should also be the basis of spark 
> product, to decide on un-ambiguity.
> For Example:
> {quote} 
>  val df1 = Seq((1, 2)).toDF("a", "b")
>   val df2 = Seq((1, 2)).toDF("aa", "bb")
>   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
> df2("aa"), df1("b"))
>   val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === 
> df1("a")).select(df1("a"))
> {quote}
> The above code from perspective #1 should throw ambiguity exception, because 
> the join condition and projection of df3 dataframe, has df1("a)  which has 
> exprId which matches both df1Joindf2 and df1.
> But if we look is from perspective of Dataset used to get column,  which is 
> the intent of the user,  the expectation is that df1("a) should be resolved 
> to Dataset df1 being joined, and not 
> df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
> df1Joindf2("a")
> So In this case , current spark throws Exception as it is using resolution 
> based on # 1
> But the below Dataframe by the above logic, should also throw Ambiguity 
> Exception but it passes
> {quote}
> val df1 = Seq((1, 2)).toDF("a", "b")
> val df2 = Seq((1, 2)).toDF("aa", "bb")
> val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
>   df2("aa"), df1("b"))
> df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
> {quote}
> The difference in the 2 cases is that in the first case , select is present.
> But in the 2nd query, select is not there.
> So this implies that in 1st case the df1("a") in projection is causing 
> ambiguity issue,  but same reference in 2nd case, used just in condition, is 
> considered un-ambiguous.
> IMHO ,  the ambiguity identification criteria should be based totally on #2 
> and consistently.
> In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of 
> the tests which are being considered ambiguous ( on # 1 criteria) become 
> un-ambiguous using (#2) criteria.
> for eg:
> {quote}
> test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
> val df1 = spark.range(3)
> val df2 = df1.filter($"id" > 0)
>   @@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
> with SharedSparkSession {
> withSQLConf(
>   SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
>   SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
>   assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
> }
>   }
> {quote}
> The above test should not have ambiguity exception thrown as df1("id") and 
> df2("id") are un-ambiguous from perspective of Dataset



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
`
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
`
Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datase

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behavi

[jira] [Created] (SPARK-47331) Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2.

2024-03-08 Thread Jing Zhan (Jira)
Jing Zhan created SPARK-47331:
-

 Summary: Serialization using case classes/primitives/POJO based on 
SQL encoder for Arbitrary State API v2. 
 Key: SPARK-47331
 URL: https://issues.apache.org/jira/browse/SPARK-47331
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Jing Zhan


In the new operator for arbitrary state-v2, we cannot rely on the 
session/encoder being available since the initialization for the various state 
instances happens on the executors. Also, we can only support limited state 
types with the available encoders. Hence, for the state serialization, we 
propose to serialize primitives/case classes/POJO with SQL encoder. Leveraging 
SQL encoder can speed up the serialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:

test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}
The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
```
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "false",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  // `df2("id")` actually points to the column of `df1`.
  checkAnswer(df1.join(df2).select(df2("id")), Seq(0, 0, 1, 1, 2, 
2).map(Row(_)))

  // Alias the dataframe and use qualified column names can fix ambiguous 
self-join.
  val aliasedDf1 = df1.alias("left")
  val aliasedDf2 = df2.as("right")
  checkAnswer(
aliasedDf1.join(aliasedDf2).select($"right.id"),
Seq(1, 1, 1, 2, 2, 2).map(Row(_)))
}

withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
 
// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
}
}
```

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datasets

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes
{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.

for eg:
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in join condition") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)

@@ -118,29 +139,32 @@ class DataFrameSelfJoinSuite extends QueryTest 
with SharedSparkSession {
withSQLConf(
  SQLConf.FAIL_AMBIGUOUS_SELF_JOIN_ENABLED.key -> "true",
  SQLConf.CROSS_JOINS_ENABLED.key -> "true") {
  assertAmbiguousSelfJoin(df1.join(df2, df1("id") > df2("id")))
}
  }
{quote}

The above test should not have ambiguity exception thrown as df1("id") and 
df2("id") are un-ambiguous from perspective of Dataset


There is an existing test in DataFrameSelfJoinSuite
{quote}
test("SPARK-28344: fail ambiguous self join - column ref in Project") {
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)



// Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  // Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))

}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approac

[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Description: 
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === df1("a")).select(df1("a"))
{quote}

The above code from perspective #1 should throw ambiguity exception, because 
the join condition and projection of df3 dataframe, has df1("a)  which has 
exprId which matches both df1Joindf2 and df1.

But if we look is from perspective of Dataset used to get column,  which is the 
intent of the user,  the expectation is that df1("a) should be resolved to 
Dataset df1 being joined, and not 
df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
df1Joindf2("a")

So In this case , current spark throws Exception as it is using resolution 
based on # 1

But the below Dataframe by the above logic, should also throw Ambiguity 
Exception but it passes

{quote}
val df1 = Seq((1, 2)).toDF("a", "b")
val df2 = Seq((1, 2)).toDF("aa", "bb")
val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
  df2("aa"), df1("b"))

df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))

{quote}

The difference in the 2 cases is that in the first case , select is present.
But in the 2nd query, select is not there.

So this implies that in 1st case the df1("a") in projection is causing 
ambiguity issue,  but same reference in 2nd case, used just in condition, is 
considered un-ambiguous.


IMHO ,  the ambiguity identification criteria should be based totally on #2 and 
consistently.

In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of the 
tests which are being considered ambiguous ( on # 1 criteria) become 
un-ambiguous using (#2) criteria.


There is an existing test in DataFrameSelfJoinSuite
{quote}

test("SPARK-28344: fail ambiguous self join - column ref in Project") 
val df1 = spark.range(3)
val df2 = df1.filter($"id" > 0)



 Assertion1  : existing 
 assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))

  Assertion2 :  added by me
  assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))

}
{quote}

Here the Assertion1 passes ( that is ambiguous exception is thrown)
But the Assertion2 fails ( that is no ambiguous exception is thrown)
The only chnage is the join order

Logically both the assertions are invalid ( In the sense both should NOT be 
throwing Exception as from the user's perspective there is no ambiguity.

  was:
The behaviour of Datasets involving self joins behave in an unintuitive manner 
in terms when AnalysisException is thrown due to ambiguity and when it works.

Found situations where join order swapping causes query to throw Ambiguity 
related exceptions which otherwise passes.  Some of the Datasets which from 
user perspective are un-ambiguous will result in Analysis Exception getting 
thrown.

After testing and fixing a bug , I think the issue lies in inconsistency in 
determining what constitutes ambiguous and what is un-ambiguous.

There are two ways to look at resolution regarding ambiguity

1) ExprId of attributes : This is unintuitive approach as spark users do not 
bother with the ExprIds

2) Column Extraction from the Dataset using df(col) api : Which is the user 
visible/understandable Point of View.  So determining ambiguity should be based 
on this. What is Logically unambiguous from users perspective ( assuming its is 
logically correct) , should also be the basis of spark product, to decide on 
un-ambiguity.

For Example:
{quote} 
 val df1 = Seq((1, 2)).toDF("a", "b")
  val df2 = Seq((1, 2)).toDF("aa", "bb")
  val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
df2("aa"), df1("b"))
  val df3 = df1Joindf2.join(df1, df

[jira] [Updated] (SPARK-47331) Serialization using case classes/primitives/POJO based on SQL encoder for Arbitrary State API v2.

2024-03-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47331:
---
Labels: pull-request-available  (was: )

> Serialization using case classes/primitives/POJO based on SQL encoder for 
> Arbitrary State API v2. 
> --
>
> Key: SPARK-47331
> URL: https://issues.apache.org/jira/browse/SPARK-47331
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Jing Zhan
>Priority: Major
>  Labels: pull-request-available
>
> In the new operator for arbitrary state-v2, we cannot rely on the 
> session/encoder being available since the initialization for the various 
> state instances happens on the executors. Also, we can only support limited 
> state types with the available encoders. Hence, for the state serialization, 
> we propose to serialize primitives/case classes/POJO with SQL encoder. 
> Leveraging SQL encoder can speed up the serialization.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47332) StreamingPythonRunner don't need redundant logic for starting python process

2024-03-08 Thread Wei Liu (Jira)
Wei Liu created SPARK-47332:
---

 Summary: StreamingPythonRunner don't need redundant logic for 
starting python process
 Key: SPARK-47332
 URL: https://issues.apache.org/jira/browse/SPARK-47332
 Project: Spark
  Issue Type: New Feature
  Components: Connect, SS, Structured Streaming
Affects Versions: 4.0.0
Reporter: Wei Liu


https://github.com/apache/spark/pull/45023#discussion_r1516609093



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47320) Datasets involving self joins behave in an inconsistent and unintuitive manner

2024-03-08 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-47320:
-
Labels: pull-request-available  (was: )

> Datasets involving self joins behave in an inconsistent and unintuitive  
> manner 
> 
>
> Key: SPARK-47320
> URL: https://issues.apache.org/jira/browse/SPARK-47320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Asif
>Priority: Major
>  Labels: pull-request-available
>
> The behaviour of Datasets involving self joins behave in an unintuitive 
> manner in terms when AnalysisException is thrown due to ambiguity and when it 
> works.
> Found situations where join order swapping causes query to throw Ambiguity 
> related exceptions which otherwise passes.  Some of the Datasets which from 
> user perspective are un-ambiguous will result in Analysis Exception getting 
> thrown.
> After testing and fixing a bug , I think the issue lies in inconsistency in 
> determining what constitutes ambiguous and what is un-ambiguous.
> There are two ways to look at resolution regarding ambiguity
> 1) ExprId of attributes : This is unintuitive approach as spark users do not 
> bother with the ExprIds
> 2) Column Extraction from the Dataset using df(col) api : Which is the user 
> visible/understandable Point of View.  So determining ambiguity should be 
> based on this. What is Logically unambiguous from users perspective ( 
> assuming its is logically correct) , should also be the basis of spark 
> product, to decide on un-ambiguity.
> For Example:
> {quote} 
>  val df1 = Seq((1, 2)).toDF("a", "b")
>   val df2 = Seq((1, 2)).toDF("aa", "bb")
>   val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
> df2("aa"), df1("b"))
>   val df3 = df1Joindf2.join(df1, df1Joindf2("aa") === 
> df1("a")).select(df1("a"))
> {quote}
> The above code from perspective #1 should throw ambiguity exception, because 
> the join condition and projection of df3 dataframe, has df1("a)  which has 
> exprId which matches both df1Joindf2 and df1.
> But if we look is from perspective of Dataset used to get column,  which is 
> the intent of the user,  the expectation is that df1("a) should be resolved 
> to Dataset df1 being joined, and not 
> df1Joindf2.  If user intended "a" from df1Joindf2, then would have used 
> df1Joindf2("a")
> So In this case , current spark throws Exception as it is using resolution 
> based on # 1
> But the below Dataframe by the above logic, should also throw Ambiguity 
> Exception but it passes
> {quote}
> val df1 = Seq((1, 2)).toDF("a", "b")
> val df2 = Seq((1, 2)).toDF("aa", "bb")
> val df1Joindf2 = df1.join(df2, df1("a") === df2("aa")).select(df1("a"),
>   df2("aa"), df1("b"))
> df1Joindf2.join(df1, df1Joindf2("a") === df1("a"))
> {quote}
> The difference in the 2 cases is that in the first case , select is present.
> But in the 2nd query, select is not there.
> So this implies that in 1st case the df1("a") in projection is causing 
> ambiguity issue,  but same reference in 2nd case, used just in condition, is 
> considered un-ambiguous.
> IMHO ,  the ambiguity identification criteria should be based totally on #2 
> and consistently.
> In the DataFrameJoinTest and DataFrameSelfJoinTest, if we go by #2, some of 
> the tests which are being considered ambiguous ( on # 1 criteria) become 
> un-ambiguous using (#2) criteria.
> There is an existing test in DataFrameSelfJoinSuite
> {quote}
> test("SPARK-28344: fail ambiguous self join - column ref in Project") 
> val df1 = spark.range(3)
> val df2 = df1.filter($"id" > 0)
>  Assertion1  : existing 
>  assertAmbiguousSelfJoin(df1.join(df2).select(df2("id")))
>   Assertion2 :  added by me
>   assertAmbiguousSelfJoin(df2.join(df1).select(df2("id")))
> }
> {quote}
> Here the Assertion1 passes ( that is ambiguous exception is thrown)
> But the Assertion2 fails ( that is no ambiguous exception is thrown)
> The only chnage is the join order
> Logically both the assertions are invalid ( In the sense both should NOT be 
> throwing Exception as from the user's perspective there is no ambiguity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org