[jira] [Updated] (SPARK-49771) Improve Pandas Scalar Iter UDF error when output rows exceed input rows

2024-09-24 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-49771:
-
Summary: Improve Pandas Scalar Iter UDF error when output rows exceed input 
rows  (was: Improve Pandas Iter UDF error when output rows exceed input rows)

> Improve Pandas Scalar Iter UDF error when output rows exceed input rows
> ---
>
> Key: SPARK-49771
> URL: https://issues.apache.org/jira/browse/SPARK-49771
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-49771) Improve Pandas Iter UDF error when output rows exceed input rows

2024-09-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-49771:


 Summary: Improve Pandas Iter UDF error when output rows exceed 
input rows
 Key: SPARK-49771
 URL: https://issues.apache.org/jira/browse/SPARK-49771
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48999) [SS] Divide PythonStreamingDataSourceSimpleSuite

2024-07-26 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang resolved SPARK-48999.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47479
[https://github.com/apache/spark/pull/47479]

> [SS] Divide PythonStreamingDataSourceSimpleSuite
> 
>
> Key: SPARK-48999
> URL: https://issues.apache.org/jira/browse/SPARK-48999
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> PythonStreamingDataSourceSimpleSuite runs too long. Divide it into several 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48999) [SS] Divide PythonStreamingDataSourceSimpleSuite

2024-07-26 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang reassigned SPARK-48999:


Assignee: Siying Dong

> [SS] Divide PythonStreamingDataSourceSimpleSuite
> 
>
> Key: SPARK-48999
> URL: https://issues.apache.org/jira/browse/SPARK-48999
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Major
>  Labels: pull-request-available
>
> PythonStreamingDataSourceSimpleSuite runs too long. Divide it into several 
> suites.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48938) Improve error message when registering UDTFs

2024-07-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-48938:


 Summary: Improve error message when registering UDTFs
 Key: SPARK-48938
 URL: https://issues.apache.org/jira/browse/SPARK-48938
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Improve the error message when registering Python UDTFs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48825) Unify the 'See Also' section formatting across PySpark docstrings

2024-07-05 Thread Allison Wang (Jira)
Allison Wang created SPARK-48825:


 Summary: Unify the 'See Also' section formatting across PySpark 
docstrings
 Key: SPARK-48825
 URL: https://issues.apache.org/jira/browse/SPARK-48825
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Unify the 'See Also' section formatting across PySpark docstrings to make them 
consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48785) Add a simple data source example in the user guide

2024-07-02 Thread Allison Wang (Jira)
Allison Wang created SPARK-48785:


 Summary: Add a simple data source example in the user guide
 Key: SPARK-48785
 URL: https://issues.apache.org/jira/browse/SPARK-48785
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48783) Update the table-valued function documentation

2024-07-02 Thread Allison Wang (Jira)
Allison Wang created SPARK-48783:


 Summary: Update the table-valued function documentation
 Key: SPARK-48783
 URL: https://issues.apache.org/jira/browse/SPARK-48783
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48479) Support creating temp SQL functions in parser

2024-06-26 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-48479:
-
Summary: Support creating temp SQL functions in parser  (was: Support 
creating SQL functions in parser)

> Support creating temp SQL functions in parser
> -
>
> Key: SPARK-48479
> URL: https://issues.apache.org/jira/browse/SPARK-48479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add Spark SQL parser for creating SQL functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48730) Support creating persistent SQL UDFs in parser

2024-06-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-48730:


 Summary: Support creating persistent SQL UDFs in parser
 Key: SPARK-48730
 URL: https://issues.apache.org/jira/browse/SPARK-48730
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48729) Add a UserDefinedFunction interface to represent a SQL function

2024-06-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-48729:


 Summary: Add a UserDefinedFunction interface to represent a SQL 
function
 Key: SPARK-48729
 URL: https://issues.apache.org/jira/browse/SPARK-48729
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48653) Fix Python data source error class references

2024-06-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-48653:


 Summary: Fix Python data source error class references
 Key: SPARK-48653
 URL: https://issues.apache.org/jira/browse/SPARK-48653
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Fix invalid error class references.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48497) Add user guide for batch data source write API

2024-05-31 Thread Allison Wang (Jira)
Allison Wang created SPARK-48497:


 Summary: Add user guide for batch data source write API
 Key: SPARK-48497
 URL: https://issues.apache.org/jira/browse/SPARK-48497
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add examples for batch data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48479) Support creating SQL functions in parser

2024-05-30 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-48479:
-
Summary: Support creating SQL functions in parser  (was: Support ccreating 
SQL functions in parser)

> Support creating SQL functions in parser
> 
>
> Key: SPARK-48479
> URL: https://issues.apache.org/jira/browse/SPARK-48479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Add Spark SQL parser for creating SQL functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48479) Support ccreating SQL functions in parser

2024-05-30 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-48479:
-
Summary: Support ccreating SQL functions in parser  (was: Add support for 
creating SQL functions in parser)

> Support ccreating SQL functions in parser
> -
>
> Key: SPARK-48479
> URL: https://issues.apache.org/jira/browse/SPARK-48479
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Add Spark SQL parser for creating SQL functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48479) Add support for creating SQL functions in parser

2024-05-30 Thread Allison Wang (Jira)
Allison Wang created SPARK-48479:


 Summary: Add support for creating SQL functions in parser
 Key: SPARK-48479
 URL: https://issues.apache.org/jira/browse/SPARK-48479
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Add Spark SQL parser for creating SQL functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48205) Remove the private[sql] modifier for Python data sources

2024-05-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-48205:


 Summary: Remove the private[sql] modifier for Python data sources
 Key: SPARK-48205
 URL: https://issues.apache.org/jira/browse/SPARK-48205
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


To make it consistent with UDFs and UDTFs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48064) Improve error messages for routine related errors

2024-04-30 Thread Allison Wang (Jira)
Allison Wang created SPARK-48064:


 Summary: Improve error messages for routine related errors
 Key: SPARK-48064
 URL: https://issues.apache.org/jira/browse/SPARK-48064
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48014) Change the makeFromJava error in EvaluatePython to a user-facing error

2024-04-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-48014:


 Summary: Change the makeFromJava error in EvaluatePython to a 
user-facing error
 Key: SPARK-48014
 URL: https://issues.apache.org/jira/browse/SPARK-48014
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47921) Fix ExecuteJobTag creation in ExecuteHolder

2024-04-19 Thread Allison Wang (Jira)
Allison Wang created SPARK-47921:


 Summary: Fix ExecuteJobTag creation in ExecuteHolder
 Key: SPARK-47921
 URL: https://issues.apache.org/jira/browse/SPARK-47921
 Project: Spark
  Issue Type: Bug
  Components: Connect
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47367) Support Python data source API with Spark Connect

2024-03-12 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-47367:
-
Summary: Support Python data source API with Spark Connect  (was: Support 
Python data source API in Spark Connect)

> Support Python data source API with Spark Connect
> -
>
> Key: SPARK-47367
> URL: https://issues.apache.org/jira/browse/SPARK-47367
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47367) Support Python data source API in Spark Connect

2024-03-12 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-47367:
-
Summary: Support Python data source API in Spark Connect  (was: Support 
Python data source API with Spark Connect)

> Support Python data source API in Spark Connect
> ---
>
> Key: SPARK-47367
> URL: https://issues.apache.org/jira/browse/SPARK-47367
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47367) Support Python data source API with Spark Connect

2024-03-12 Thread Allison Wang (Jira)
Allison Wang created SPARK-47367:


 Summary: Support Python data source API with Spark Connect
 Key: SPARK-47367
 URL: https://issues.apache.org/jira/browse/SPARK-47367
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47346) Make daemon mode configurable when creating Python workers

2024-03-11 Thread Allison Wang (Jira)
Allison Wang created SPARK-47346:


 Summary: Make daemon mode configurable when creating Python workers
 Key: SPARK-47346
 URL: https://issues.apache.org/jira/browse/SPARK-47346
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46973) Skip V2 table lookup when a table is in V1 table cache

2024-03-02 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-46973:
-
Description: Improve v2 table lookup performance when a table is already in 
the v1 table cache.

> Skip V2 table lookup when a table is in V1 table cache
> --
>
> Key: SPARK-46973
> URL: https://issues.apache.org/jira/browse/SPARK-46973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Improve v2 table lookup performance when a table is already in the v1 table 
> cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46973) Skip V2 table lookup when a table is in V1 table cache

2024-03-02 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-46973:
-
Summary: Skip V2 table lookup when a table is in V1 table cache  (was: Add 
table cache for V2 tables)

> Skip V2 table lookup when a table is in V1 table cache
> --
>
> Key: SPARK-46973
> URL: https://issues.apache.org/jira/browse/SPARK-46973
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46973) Add table cache for V2 tables

2024-02-04 Thread Allison Wang (Jira)
Allison Wang created SPARK-46973:


 Summary: Add table cache for V2 tables
 Key: SPARK-46973
 URL: https://issues.apache.org/jira/browse/SPARK-46973
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46818) Improve error messages for range with non-foldable input

2024-01-23 Thread Allison Wang (Jira)
Allison Wang created SPARK-46818:


 Summary: Improve error messages for range with non-foldable input
 Key: SPARK-46818
 URL: https://issues.apache.org/jira/browse/SPARK-46818
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46618) Improve error messages for DATA_SOURCE_NOT_FOUND error

2024-01-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-46618:


 Summary: Improve error messages for DATA_SOURCE_NOT_FOUND error
 Key: SPARK-46618
 URL: https://issues.apache.org/jira/browse/SPARK-46618
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Improve the error messages for DATA_SOURCE_NOT_FOUND error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46616) Disallow re-registration of statically registered data sources

2024-01-07 Thread Allison Wang (Jira)
Allison Wang created SPARK-46616:


 Summary: Disallow re-registration of statically registered data 
sources
 Key: SPARK-46616
 URL: https://issues.apache.org/jira/browse/SPARK-46616
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


This is a follow-up for SPARK-46522. Currently, Spark allows the 
re-registration of both statically and dynamically registered Python data 
sources. However, for (statically) registered Java/Scala data sources, Spark 
currently throws exceptions if users try to register the data soruce with the 
same name.

We should make this behavior consistent: either allow re-registration of all 
statically loaded data sources, or disallow them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46568) Python data source options should be a case insensitive dictionary

2024-01-02 Thread Allison Wang (Jira)
Allison Wang created SPARK-46568:


 Summary: Python data source options should be a case insensitive 
dictionary
 Key: SPARK-46568
 URL: https://issues.apache.org/jira/browse/SPARK-46568
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Data source options are stored as a `CaseInsensitiveStringMap` in Scala, 
however, its behavior is inconsistent in Python:
{code:java}
class MyDataSource(DataSource):
def __init__(self, options):
self.api_key = options.get("API_KEY") # <- This is None

spark.read.format(..).option("API_KEY", my_key).load(...){code}
Currently, options will not have this "API_KEY" as everything is converted to 
lowercase on the Scala side. This can be confusing to users.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46565) Improve Python data source error classes and messages

2024-01-02 Thread Allison Wang (Jira)
Allison Wang created SPARK-46565:


 Summary: Improve Python data source error classes and messages
 Key: SPARK-46565
 URL: https://issues.apache.org/jira/browse/SPARK-46565
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46540) Respect column names when Python data source read function outputs named Row objects

2023-12-28 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-46540:
-
Summary: Respect column names when Python data source read function outputs 
named Row objects  (was: Respect named arguments when Python data source read 
function outputs Row objects)

> Respect column names when Python data source read function outputs named Row 
> objects
> 
>
> Key: SPARK-46540
> URL: https://issues.apache.org/jira/browse/SPARK-46540
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46540) Respects named arguments when Python data source read function outputs Row objects

2023-12-28 Thread Allison Wang (Jira)
Allison Wang created SPARK-46540:


 Summary: Respects named arguments when Python data source read 
function outputs Row objects
 Key: SPARK-46540
 URL: https://issues.apache.org/jira/browse/SPARK-46540
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46540) Respect named arguments when Python data source read function outputs Row objects

2023-12-28 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-46540:
-
Summary: Respect named arguments when Python data source read function 
outputs Row objects  (was: Respects named arguments when Python data source 
read function outputs Row objects)

> Respect named arguments when Python data source read function outputs Row 
> objects
> -
>
> Key: SPARK-46540
> URL: https://issues.apache.org/jira/browse/SPARK-46540
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46522) Block Python data source registration with name conflicts

2023-12-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-46522:


 Summary: Block Python data source registration with name conflicts
 Key: SPARK-46522
 URL: https://issues.apache.org/jira/browse/SPARK-46522
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Users should not be allowed to register Python data sources with names that are 
the same as builtin or existing Scala/Java data sources.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46520) Support overwrite mode for Python data source write

2023-12-26 Thread Allison Wang (Jira)
Allison Wang created SPARK-46520:


 Summary: Support overwrite mode for Python data source write
 Key: SPARK-46520
 URL: https://issues.apache.org/jira/browse/SPARK-46520
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support the `overwrite` mode for Python data source



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records

2023-12-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-46452:


 Summary: Add a new API in DSv2 DataWriter to write an iterator of 
records
 Key: SPARK-46452
 URL: https://issues.apache.org/jira/browse/SPARK-46452
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a new API that takes an iterator of records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46375) Add documentation for Python data source API

2023-12-11 Thread Allison Wang (Jira)
Allison Wang created SPARK-46375:


 Summary: Add documentation for Python data source API
 Key: SPARK-46375
 URL: https://issues.apache.org/jira/browse/SPARK-46375
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add documentation (user guide) for Python data soruce API.

 

Note the documentation should clarify the required dependency: pyarrow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor

2023-12-06 Thread Allison Wang (Jira)
Allison Wang created SPARK-46290:


 Summary: Change saveMode to overwrite for DataSourceWriter 
constructor
 Key: SPARK-46290
 URL: https://issues.apache.org/jira/browse/SPARK-46290
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46273) Support INSERT INTO/OVERWRITE using DSv2 sources

2023-12-05 Thread Allison Wang (Jira)
Allison Wang created SPARK-46273:


 Summary: Support INSERT INTO/OVERWRITE using DSv2 sources
 Key: SPARK-46273
 URL: https://issues.apache.org/jira/browse/SPARK-46273
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46272) Support CTAS using DSv2 sources

2023-12-05 Thread Allison Wang (Jira)
Allison Wang created SPARK-46272:


 Summary: Support CTAS using DSv2 sources
 Key: SPARK-46272
 URL: https://issues.apache.org/jira/browse/SPARK-46272
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46253) Plan Python data source read using mapInArrow

2023-12-04 Thread Allison Wang (Jira)
Allison Wang created SPARK-46253:


 Summary: Plan Python data source read using mapInArrow
 Key: SPARK-46253
 URL: https://issues.apache.org/jira/browse/SPARK-46253
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Instead of using a regular Python UDTF, we can actually use an arrow UDF and 
plan the data source read using the mapInArrow operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46057) Support SQL user-defined functions

2023-11-22 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-46057:
-
Description: This is an umbrella ticket to support SQL user-defined 
functions.  (was: This is an umbrella ticket to support SQL user-defined 
functions in Spark.)

> Support SQL user-defined functions
> --
>
> Key: SPARK-46057
> URL: https://issues.apache.org/jira/browse/SPARK-46057
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> This is an umbrella ticket to support SQL user-defined functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46057) Support SQL user-defined functions

2023-11-22 Thread Allison Wang (Jira)
Allison Wang created SPARK-46057:


 Summary: Support SQL user-defined functions
 Key: SPARK-46057
 URL: https://issues.apache.org/jira/browse/SPARK-46057
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


This is an umbrella ticket to support SQL user-defined functions in Spark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46043) Support create table using DSv2 sources

2023-11-21 Thread Allison Wang (Jira)
Allison Wang created SPARK-46043:


 Summary: Support create table using DSv2 sources
 Key: SPARK-46043
 URL: https://issues.apache.org/jira/browse/SPARK-46043
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Allison Wang


Support CREATE TABLE ... USING DSv2 sources.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46013) Improve basic datasource examples

2023-11-20 Thread Allison Wang (Jira)
Allison Wang created SPARK-46013:


 Summary: Improve basic datasource examples
 Key: SPARK-46013
 URL: https://issues.apache.org/jira/browse/SPARK-46013
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should improve the Python examples on this page: 
[https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
 (basic_datasource_examples.py)

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45940) Add InputPartition to DataSourceReader interface

2023-11-15 Thread Allison Wang (Jira)
Allison Wang created SPARK-45940:


 Summary: Add InputPartition to DataSourceReader interface
 Key: SPARK-45940
 URL: https://issues.apache.org/jira/browse/SPARK-45940
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add InputPartition class and make the partitions method return a list of input 
partitions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation

2023-11-15 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786487#comment-17786487
 ] 

Allison Wang commented on SPARK-45861:
--

[~panbingkun] again, thanks for working on this. Let me give you more details.

When people search on Google for example "spark create dataframe", you can see 
there are many results, one of them being the PySpark documentation - 
createDataFrame.

But there are many other ways to create a dataframe, for example from various 
data sources (CSV, JDBC, Parquet, etc), from pandas dataframe, from 
`spark.sql`, etc. 

We want to create a new documentation page under `{*}User Guides{*}` to explain 
all kinds of ways people can use to create a Spark data frame. It's different 
from the quickstart in that the user guide will provide more comprehensive 
examples.

Feel free to take a look at the results when you search "spark create 
dataframe" or even "create dataframe" to get more inspirations.

cc [~afolting] [~smilegator]

> Add user guide for dataframe creation
> -
>
> Key: SPARK-45861
> URL: https://issues.apache.org/jira/browse/SPARK-45861
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Add a simple user guide for data frame creation.
> This user guide should cover the following APIs:
>  # df.createDataFrame
>  # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45931) Refine docstring of `mapInPandas`

2023-11-14 Thread Allison Wang (Jira)
Allison Wang created SPARK-45931:


 Summary: Refine docstring of `mapInPandas`
 Key: SPARK-45931
 URL: https://issues.apache.org/jira/browse/SPARK-45931
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Refine the docstring of the mapInPandas function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow

2023-11-14 Thread Allison Wang (Jira)
Allison Wang created SPARK-45930:


 Summary: Allow non-deterministic Python UDFs in 
MapInPandas/MapInArrow
 Key: SPARK-45930
 URL: https://issues.apache.org/jira/browse/SPARK-45930
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently if a Python udf is non-deterministic, the analyzer will fail with 
this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a 
deterministic expression, but the actual expression is "pyUDF()", "a". 
SQLSTATE: 42K0E;



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45927) Update `path` handling in Python data source

2023-11-14 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45927:
-
Summary: Update `path` handling in Python data source  (was: Remove `path` 
from data source constructor)

> Update `path` handling in Python data source
> 
>
> Key: SPARK-45927
> URL: https://issues.apache.org/jira/browse/SPARK-45927
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should not make `path` an argument to the constructor of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45927) Remove `path` from data source constructor

2023-11-14 Thread Allison Wang (Jira)
Allison Wang created SPARK-45927:


 Summary: Remove `path` from data source constructor
 Key: SPARK-45927
 URL: https://issues.apache.org/jira/browse/SPARK-45927
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should not make `path` an argument to the constructor of the API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45914) Support `commit` and `abort` API for Python data source write

2023-11-13 Thread Allison Wang (Jira)
Allison Wang created SPARK-45914:


 Summary: Support `commit` and `abort` API for Python data source 
write
 Key: SPARK-45914
 URL: https://issues.apache.org/jira/browse/SPARK-45914
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support `commit` and `abort` API for Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45525) Initial support for Python data source write API

2023-11-13 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45525:
-
Description: Add a new command and logical rules (similar to V1Writes and 
V2Writes) to support Python data source write.  (was: Support for Python data 
source write API)

> Initial support for Python data source write API
> 
>
> Key: SPARK-45525
> URL: https://issues.apache.org/jira/browse/SPARK-45525
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Add a new command and logical rules (similar to V1Writes and V2Writes) to 
> support Python data source write.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make Python data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Description: Currently, registered data sources are stored in `sharedState` 
and can be accessed across multiple sessions. This, however, will not work with 
Spark Connect. We should make this registration session level, and support 
static registration (e.g. using pip install) in the future.  (was: Currently we 
have added a few instance variables to store information for Python data source 
reader. We should have a dedicated reader class for Python data source to make 
the current DataFrameReader clean.)

> Make Python data source registration session level
> --
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, registered data sources are stored in `sharedState` and can be 
> accessed across multiple sessions. This, however, will not work with Spark 
> Connect. We should make this registration session level, and support static 
> registration (e.g. using pip install) in the future.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Summary: Make data source registration session level  (was: Separate the 
Python data source logic from DataFrameReader)

> Make data source registration session level
> ---
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently we have added a few instance variables to store information for 
> Python data source reader. We should have a dedicated reader class for Python 
> data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45600) Make Python data source registration session level

2023-11-09 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45600:
-
Summary: Make Python data source registration session level  (was: Make 
data source registration session level)

> Make Python data source registration session level
> --
>
> Key: SPARK-45600
> URL: https://issues.apache.org/jira/browse/SPARK-45600
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently we have added a few instance variables to store information for 
> Python data source reader. We should have a dedicated reader class for Python 
> data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45865) Add user guide for window operations

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45865:


 Summary: Add user guide for window operations
 Key: SPARK-45865
 URL: https://issues.apache.org/jira/browse/SPARK-45865
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for window operations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45864) Add user guide for groupby and aggregate

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45864:


 Summary: Add user guide for groupby and aggregate
 Key: SPARK-45864
 URL: https://issues.apache.org/jira/browse/SPARK-45864
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide to showcase common DataFrame operations involving group 
by and aggregate functions (min, max, count, sum, etc)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45863) Add user guide for column selections

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45863:


 Summary: Add user guide for column selections
 Key: SPARK-45863
 URL: https://issues.apache.org/jira/browse/SPARK-45863
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for column selections in PySpark. This should cover the 
following API: lit, df.col, and cover common column operations such as: 
removing a column from a data frame, adding new columns, dropping a duplicate 
column, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45862) Add user guide for basic dataframe operations

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45862:


 Summary: Add user guide for basic dataframe operations
 Key: SPARK-45862
 URL: https://issues.apache.org/jira/browse/SPARK-45862
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for basic DataFrame operations. This user guide should 
include the following APIs: select, filter, collect, show



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45861) Add user guide for dataframe creation

2023-11-09 Thread Allison Wang (Jira)
Allison Wang created SPARK-45861:


 Summary: Add user guide for dataframe creation
 Key: SPARK-45861
 URL: https://issues.apache.org/jira/browse/SPARK-45861
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add a simple user guide for data frame creation.

This user guide should cover the following APIs:
 # df.createDataFrame
 # spark.read.format(...) (can be csv, json, parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45783) Improve exception message when no remote url is set

2023-11-03 Thread Allison Wang (Jira)
Allison Wang created SPARK-45783:


 Summary: Improve exception message when no remote url is set 
 Key: SPARK-45783
 URL: https://issues.apache.org/jira/browse/SPARK-45783
 Project: Spark
  Issue Type: Improvement
  Components: Connect, PySpark
Affects Versions: 3.5.0, 4.0.0
Reporter: Allison Wang


When "SPARK_CONNECT_MODE_ENABLED" but no spark remote url is set, PySpark 
currently throws this exception:

AttributeError: 'NoneType' object has no attribute 'startswith'

We should improve this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45773) Refine docstring of `SparkSession.builder.config`

2023-11-02 Thread Allison Wang (Jira)
Allison Wang created SPARK-45773:


 Summary: Refine docstring of `SparkSession.builder.config`
 Key: SPARK-45773
 URL: https://issues.apache.org/jira/browse/SPARK-45773
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Refine the docstring of SparkSession.builder.config

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45765:
-
Description: 
Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02

This can be confusing but it's valid error message, as "p2" will be considered 
as the `format` field of the load() method. 

  was:
Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02

We should fix this.


> Improve error messages when loading multiple paths in PySpark
> -
>
> Key: SPARK-45765
> URL: https://issues.apache.org/jira/browse/SPARK-45765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the error message is super confusing when a user tries to load 
> multiple paths incorrectly.
> For example, `spark.read.format("json").load("p1", "p2")` will have this 
> error:
> An error occurred while calling o36.load.
> : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] 
> Failed to find the data source: p2. Please find packages at 
> `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02
> This can be confusing but it's valid error message, as "p2" will be 
> considered as the `format` field of the load() method. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang resolved SPARK-45765.
--
Resolution: Invalid

> Improve error messages when loading multiple paths in PySpark
> -
>
> Key: SPARK-45765
> URL: https://issues.apache.org/jira/browse/SPARK-45765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the error message is super confusing when a user tries to load 
> multiple paths incorrectly.
> For example, `spark.read.format("json").load("p1", "p2")` will have this 
> error:
> An error occurred while calling o36.load.
> : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] 
> Failed to find the data source: p2. Please find packages at 
> `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02
> We should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)
Allison Wang created SPARK-45765:


 Summary: Improve error messages when loading multiple paths in 
PySpark
 Key: SPARK-45765
 URL: https://issues.apache.org/jira/browse/SPARK-45765
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02

We should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45764:
-
Description: 
We should consider adding a copy button next to the pyspark code blocks.

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]

  was:
We should consider 

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]


> Make code block copyable
> 
>
> Key: SPARK-45764
> URL: https://issues.apache.org/jira/browse/SPARK-45764
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should consider adding a copy button next to the pyspark code blocks.
> For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781887#comment-17781887
 ] 

Allison Wang commented on SPARK-45764:
--

cc [~podongfeng] WDYT?

> Make code block copyable
> 
>
> Key: SPARK-45764
> URL: https://issues.apache.org/jira/browse/SPARK-45764
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should consider adding a copy button next to the pyspark code blocks.
> For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)
Allison Wang created SPARK-45764:


 Summary: Make code block copyable
 Key: SPARK-45764
 URL: https://issues.apache.org/jira/browse/SPARK-45764
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should consider 

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45713) Support registering Python data sources

2023-10-27 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45713:
-
Description: 
Support registering Python data sources.

Users can register a Python data source and later use reference it using its 
name.
{code:java}
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my-data-source"

spark.dataSource.register(MyDataSource){code}
Users can then use the name of the data source as the format (will be supported 
in SPARK-45639)
{code:java}
spark.read.format("my-data-source").load(){code}

  was:
Support registering Python data sources.

Users can register a Python data source and later use reference it using its 
name.
{code:java}
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my-data-source"

spark.dataSource.register(MyDataSource){code}
Users can then use the name of the data source as the format SPARK-45639
{code:java}
spark.read.format("my-data-source").load(){code}


> Support registering Python data sources
> ---
>
> Key: SPARK-45713
> URL: https://issues.apache.org/jira/browse/SPARK-45713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support registering Python data sources.
> Users can register a Python data source and later use reference it using its 
> name.
> {code:java}
> class MyDataSource(DataSource):
> @classmethod
> def name(cls):
> return "my-data-source"
> spark.dataSource.register(MyDataSource){code}
> Users can then use the name of the data source as the format (will be 
> supported in SPARK-45639)
> {code:java}
> spark.read.format("my-data-source").load(){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45713) Support registering Python data sources

2023-10-27 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45713:
-
Description: 
Support registering Python data sources.

Users can register a Python data source and later use reference it using its 
name.
{code:java}
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my-data-source"

spark.dataSource.register(MyDataSource){code}
Users can then use the name of the data source as the format SPARK-45639
{code:java}
spark.read.format("my-data-source").load(){code}

  was:
Support registering Python data sources.

Users can register a Python data source and later use reference it using its 
name.
{code:java}
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my-data-source"

spark.dataSource.register(MyDataSource){code}
Users can then use the name of the data source as the format
{code:java}
spark.read.format("my-data-source").load(){code}


> Support registering Python data sources
> ---
>
> Key: SPARK-45713
> URL: https://issues.apache.org/jira/browse/SPARK-45713
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support registering Python data sources.
> Users can register a Python data source and later use reference it using its 
> name.
> {code:java}
> class MyDataSource(DataSource):
> @classmethod
> def name(cls):
> return "my-data-source"
> spark.dataSource.register(MyDataSource){code}
> Users can then use the name of the data source as the format SPARK-45639
> {code:java}
> spark.read.format("my-data-source").load(){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45713) Support registering Python data sources

2023-10-27 Thread Allison Wang (Jira)
Allison Wang created SPARK-45713:


 Summary: Support registering Python data sources
 Key: SPARK-45713
 URL: https://issues.apache.org/jira/browse/SPARK-45713
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support registering Python data sources.

Users can register a Python data source and later use reference it using its 
name.
{code:java}
class MyDataSource(DataSource):
@classmethod
def name(cls):
return "my-data-source"

spark.dataSource.register(MyDataSource){code}
Users can then use the name of the data source as the format
{code:java}
spark.read.format("my-data-source").load(){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader

2023-10-27 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45639:
-
Description: 
Allow users to read from a Python data source using 
`spark.read.format(...).load()` in PySpark. For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
{code:java}
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"
    
def reader(self, schema):
        return MyReader()

df = spark.read.format("MyDataSource").load()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
{code}
 

  was:
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
{code:java}
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"
    def reader(self, schema):
        return MyReader()

df = spark.read.format("MyDataSource").load()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
{code}
 


> Support loading Python data sources in DataFrameReader
> --
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()` in PySpark. For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> {code:java}
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     
> def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader

2023-10-27 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45639:
-
Summary: Support loading Python data sources in DataFrameReader  (was: 
Support Python data source in DataFrameReader)

> Support loading Python data sources in DataFrameReader
> --
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()`
> For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> {code:java}
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45654) Add Python data source write API

2023-10-24 Thread Allison Wang (Jira)
Allison Wang created SPARK-45654:


 Summary: Add Python data source write API
 Key: SPARK-45654
 URL: https://issues.apache.org/jira/browse/SPARK-45654
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Add Python data source write API in datasource.py 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader

2023-10-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45639:
-
Description: 
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
{code:java}
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"
    def reader(self, schema):
        return MyReader()

df = spark.read.format("MyDataSource").load()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
{code}
 

  was:
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
{code:java}
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)
class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"
    def reader(self, schema):
        return MyReader()
df = spark.read.format("MyDataSource").load()
df.show()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
{code}
 


> Support Python data source in DataFrameReader
> -
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()`
> For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> {code:java}
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader

2023-10-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45639:
-
Description: 
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
{code:java}
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)
class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"
    def reader(self, schema):
        return MyReader()
df = spark.read.format("MyDataSource").load()
df.show()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
{code}
 

  was:
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
```python
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"

    def reader(self, schema):
        return MyReader()

df = spark.read.format("MyDataSource").load()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
```


> Support Python data source in DataFrameReader
> -
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()`
> For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> {code:java}
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader

2023-10-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45639:
-
Description: 
Allow users to read from a Python data source using 
`spark.read.format(...).load()`

For example

Users can extend the DataSource and the DataSourceReader classes to create 
their own Python data source reader and use them in PySpark:
```python
class MyReader(DataSourceReader):
    def read(self, partition):
        yield (0, 1)

class MyDataSource(DataSource):
    def schema(self):
        return "id INT, value INT"

    def reader(self, schema):
        return MyReader()

df = spark.read.format("MyDataSource").load()
df.show()
+---+-+
| id|value|
+---+-+
|  0|    1|
+---+-+
```

  was:Allow users to read from a Python data source using 
`spark.read.format(...).load()`


> Support Python data source in DataFrameReader
> -
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()`
> For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> ```python
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45639) Support Python data source in DataFrameReader

2023-10-23 Thread Allison Wang (Jira)
Allison Wang created SPARK-45639:


 Summary: Support Python data source in DataFrameReader
 Key: SPARK-45639
 URL: https://issues.apache.org/jira/browse/SPARK-45639
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Allow users to read from a Python data source using 
`spark.read.format(...).load()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45524) Initial support for Python data source read API

2023-10-23 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45524:
-
Description: 
Add API for data source and data source reader and add Catalyst + execution 
support.

 

  was:Support Python data source API for reading data.


> Initial support for Python data source read API
> ---
>
> Key: SPARK-45524
> URL: https://issues.apache.org/jira/browse/SPARK-45524
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Add API for data source and data source reader and add Catalyst + execution 
> support.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures

2023-10-20 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1954#comment-1954
 ] 

Allison Wang commented on SPARK-45023:
--

[~abhinavofficial] this proposal is on hold, given the feedback received from 
the SPIP.

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45023) SPIP: Python Stored Procedures

2023-10-20 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang resolved SPARK-45023.
--
Resolution: Won't Do

> SPIP: Python Stored Procedures
> --
>
> Key: SPARK-45023
> URL: https://issues.apache.org/jira/browse/SPARK-45023
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Stored procedures are an extension of the ANSI SQL standard. They play a 
> crucial role in improving the capabilities of SQL by encapsulating complex 
> logic into reusable routines. 
> This proposal aims to extend Spark SQL by introducing support for stored 
> procedures, starting with Python as the procedural language. This addition 
> will allow users to execute procedural programs, leveraging programming 
> constructs of Python to perform tasks with complex logic. Additionally, users 
> can persist these procedural routines in catalogs such as HMS for future 
> reuse. By providing this functionality, we intend to seamlessly empower Spark 
> users to integrate with Python routines within their SQL workflows.
> {*}SPIP{*}: 
> [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45600) Separate the Python data source logic from DataFrameReader

2023-10-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-45600:


 Summary: Separate the Python data source logic from DataFrameReader
 Key: SPARK-45600
 URL: https://issues.apache.org/jira/browse/SPARK-45600
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently we have added a few instance variables to store information for 
Python data source reader. We should have a dedicated reader class for Python 
data source to make the current DataFrameReader clean.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API

2023-10-18 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45559:
-
Description: 
Support `spark.read.schema(...)` for Python data source read.

Add test cases where we send the schema as a string instead of StructType, and 
a positive case as well as a negative case where it doesn't parse successfully 
with fromDDL?

  was:Support `spark.read.schema(...)` for Python data source read


> Support spark.read.schema(...) for Python data source API
> -
>
> Key: SPARK-45559
> URL: https://issues.apache.org/jira/browse/SPARK-45559
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support `spark.read.schema(...)` for Python data source read.
> Add test cases where we send the schema as a string instead of StructType, 
> and a positive case as well as a negative case where it doesn't parse 
> successfully with fromDDL?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45597) Support creating table using a Python data source in SQL

2023-10-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-45597:


 Summary: Support creating table using a Python data source in SQL
 Key: SPARK-45597
 URL: https://issues.apache.org/jira/browse/SPARK-45597
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support creating table using a Python data source in SQL query:

For instance:

`CREATE TABLE tableName() USING  OPTIONS `



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec

2023-10-17 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45584:
-
Description: 
When there are subqueries in TakeOrderedAndProjectExec, the query can throw 
this exception:

 java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, 
[id=#109|#109] has not finished 

This is because TakeOrderedAndProjectExec does not wait for subquery execution.

  was:
When there are subqueries in TakeOrderedAndProjectExec, the query can throw 
this exception:

 java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, 
[id=#109] has not finished 


> Execution fails when there are subqueries in TakeOrderedAndProjectExec
> --
>
> Key: SPARK-45584
> URL: https://issues.apache.org/jira/browse/SPARK-45584
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
>
> When there are subqueries in TakeOrderedAndProjectExec, the query can throw 
> this exception:
>  java.lang.IllegalArgumentException: requirement failed: Subquery 
> subquery#242, [id=#109|#109] has not finished 
> This is because TakeOrderedAndProjectExec does not wait for subquery 
> execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec

2023-10-17 Thread Allison Wang (Jira)
Allison Wang created SPARK-45584:


 Summary: Execution fails when there are subqueries in 
TakeOrderedAndProjectExec
 Key: SPARK-45584
 URL: https://issues.apache.org/jira/browse/SPARK-45584
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0
Reporter: Allison Wang


When there are subqueries in TakeOrderedAndProjectExec, the query can throw 
this exception:

 java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, 
[id=#109] has not finished 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45560) Support spark.read.load() with non-empty path for Python data source API

2023-10-16 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45560:
-
Summary: Support spark.read.load() with non-empty path for Python data 
source API  (was: Support spark.read.load() with paths for Python data source 
API)

> Support spark.read.load() with non-empty path for Python data source API
> 
>
> Key: SPARK-45560
> URL: https://issues.apache.org/jira/browse/SPARK-45560
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support non-empty path for Python data source read: 
> `spark.read.format(..).load(path)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API

2023-10-16 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45559:
-
Summary: Support spark.read.schema(...) for Python data source API  (was: 
Support df.read.schema(...) for Python data source API)

> Support spark.read.schema(...) for Python data source API
> -
>
> Key: SPARK-45559
> URL: https://issues.apache.org/jira/browse/SPARK-45559
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Support `spark.read.schema(...)` for Python data source read



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45560) Support spark.read.load() with paths for Python data source API

2023-10-16 Thread Allison Wang (Jira)
Allison Wang created SPARK-45560:


 Summary: Support spark.read.load() with paths for Python data 
source API
 Key: SPARK-45560
 URL: https://issues.apache.org/jira/browse/SPARK-45560
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support non-empty path for Python data source read: 
`spark.read.format(..).load(path)` 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45559) Support df.read.schema(...) for Python data source API

2023-10-16 Thread Allison Wang (Jira)
Allison Wang created SPARK-45559:


 Summary: Support df.read.schema(...) for Python data source API
 Key: SPARK-45559
 URL: https://issues.apache.org/jira/browse/SPARK-45559
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support `spark.read.schema(...)` for Python data source read



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45526) Refine docstring of `options` for dataframe reader and writer

2023-10-12 Thread Allison Wang (Jira)
Allison Wang created SPARK-45526:


 Summary: Refine docstring of `options` for dataframe reader and 
writer
 Key: SPARK-45526
 URL: https://issues.apache.org/jira/browse/SPARK-45526
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Refine the docstring of the `options` method of DataFrameReader and 
DataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45525) Initial support for Python data source write API

2023-10-12 Thread Allison Wang (Jira)
Allison Wang created SPARK-45525:


 Summary: Initial support for Python data source write API
 Key: SPARK-45525
 URL: https://issues.apache.org/jira/browse/SPARK-45525
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support for Python data source write API



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45524) Initial support for Python data source read API

2023-10-12 Thread Allison Wang (Jira)
Allison Wang created SPARK-45524:


 Summary: Initial support for Python data source read API
 Key: SPARK-45524
 URL: https://issues.apache.org/jira/browse/SPARK-45524
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Support Python data source API for reading data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join

2023-10-11 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45509:
-
Description: 
SPARK-45220 discovers a behavior difference for a self-join scenario between 
classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect: 
{code:java}
df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", 
height=85)]){code}
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

  was:
SPARK-45220 discovers a behavior difference for a self-join scenario between 
classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect: 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 


> Investigate the behavior difference in self-join
> 
>
> Key: SPARK-45509
> URL: https://issues.apache.org/jira/browse/SPARK-45509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> SPARK-45220 discovers a behavior difference for a self-join scenario between 
> classic Spark and Spark Connect.
> For instance, here is the query that works without Spark Connect: 
> {code:java}
> df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)])
> df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", 
> height=85)]){code}
> {code:java}
> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
> joined.show(){code}
> But in Spark Connect, it throws this exception:
> {code:java}
> pyspark.errors.exceptions.connect.AnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
> with name `name` cannot be resolved. Did you mean one of the following? 
> [`name`, `name`, `age`, `height`].;
> 'Sort ['name DESC NULLS LAST], true
> +- Join FullOuter, (name#64 = name#78)
>:- LocalRelation [name#64, age#65L]
>+- LocalRelation [name#78, height#79L]
>  {code}
>  
> On the other hand, this query failed in classic Spark Connect:
> {code:java}
> df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
> {code:java}
> pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
> ambiguous... {code}
>  
> but this query works with Spark Connect.
> We need to investigate the behavior difference and fix it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join

2023-10-11 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45509:
-
Description: 
SPARK-45220 discovers a behavior difference for a self-join scenario between 
classic Spark and Spark Connect.

For instance, here is the query that works without Spark Connect: 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
But in Spark Connect, it throws this exception:
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 

  was:
SPARK-45220 discovers a behavior difference for a self-join scenario between 
class Spark and Spark Connect.

For instance. here is the query that works without Spark Connect: 

 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
 

But in Spark Connect, it throws this exception:

 
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:

 
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 


> Investigate the behavior difference in self-join
> 
>
> Key: SPARK-45509
> URL: https://issues.apache.org/jira/browse/SPARK-45509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> SPARK-45220 discovers a behavior difference for a self-join scenario between 
> classic Spark and Spark Connect.
> For instance, here is the query that works without Spark Connect: 
> {code:java}
> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
> joined.show(){code}
> But in Spark Connect, it throws this exception:
> {code:java}
> pyspark.errors.exceptions.connect.AnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
> with name `name` cannot be resolved. Did you mean one of the following? 
> [`name`, `name`, `age`, `height`].;
> 'Sort ['name DESC NULLS LAST], true
> +- Join FullOuter, (name#64 = name#78)
>:- LocalRelation [name#64, age#65L]
>+- LocalRelation [name#78, height#79L]
>  {code}
>  
> On the other hand, this query failed in classic Spark Connect:
> {code:java}
> df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
> {code:java}
> pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
> ambiguous... {code}
>  
> but this query works with Spark Connect.
> We need to investigate the behavior difference and fix it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join

2023-10-11 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45509:
-
Description: 
SPARK-45220 discovers a behavior difference for a self-join scenario between 
class Spark and Spark Connect.

For instance. here is the query that works without Spark Connect: 

 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
 

But in Spark Connect, it throws this exception:

 
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:

 
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 

  was:
SAPRK-45220 discovers a behavior difference for a self-join scenario between 
class Spark and Spark Connect.

For instance. here is the query that works without Spark Connect: 

 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
 

But in Spark Connect, it throws this exception:

 
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:

 
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 


> Investigate the behavior difference in self-join
> 
>
> Key: SPARK-45509
> URL: https://issues.apache.org/jira/browse/SPARK-45509
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> SPARK-45220 discovers a behavior difference for a self-join scenario between 
> class Spark and Spark Connect.
> For instance. here is the query that works without Spark Connect: 
>  
> {code:java}
> joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
> joined.show(){code}
>  
> But in Spark Connect, it throws this exception:
>  
> {code:java}
> pyspark.errors.exceptions.connect.AnalysisException: 
> [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
> with name `name` cannot be resolved. Did you mean one of the following? 
> [`name`, `name`, `age`, `height`].;
> 'Sort ['name DESC NULLS LAST], true
> +- Join FullOuter, (name#64 = name#78)
>:- LocalRelation [name#64, age#65L]
>+- LocalRelation [name#78, height#79L]
>  {code}
>  
> On the other hand, this query failed in classic Spark Connect:
>  
> {code:java}
> df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
> {code:java}
> pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
> ambiguous... {code}
>  
> but this query works with Spark Connect.
> We need to investigate the behavior difference and fix it.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45509) Investigate the behavior difference in self-join

2023-10-11 Thread Allison Wang (Jira)
Allison Wang created SPARK-45509:


 Summary: Investigate the behavior difference in self-join
 Key: SPARK-45509
 URL: https://issues.apache.org/jira/browse/SPARK-45509
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0, 4.0.0
Reporter: Allison Wang


SAPRK-45220 discovers a behavior difference for a self-join scenario between 
class Spark and Spark Connect.

For instance. here is the query that works without Spark Connect: 

 
{code:java}
joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) 
joined.show(){code}
 

But in Spark Connect, it throws this exception:

 
{code:java}
pyspark.errors.exceptions.connect.AnalysisException: 
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter 
with name `name` cannot be resolved. Did you mean one of the following? 
[`name`, `name`, `age`, `height`].;
'Sort ['name DESC NULLS LAST], true
+- Join FullOuter, (name#64 = name#78)
   :- LocalRelation [name#64, age#65L]
   +- LocalRelation [name#78, height#79L]
 {code}
 

On the other hand, this query failed in classic Spark Connect:

 
{code:java}
df.join(df, df.name == df.name, "outer").select(df.name).show() {code}
{code:java}
pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are 
ambiguous... {code}
 

but this query works with Spark Connect.

We need to investigate the behavior difference and fix it.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   >