[jira] [Updated] (SPARK-49771) Improve Pandas Scalar Iter UDF error when output rows exceed input rows
[ https://issues.apache.org/jira/browse/SPARK-49771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-49771: - Summary: Improve Pandas Scalar Iter UDF error when output rows exceed input rows (was: Improve Pandas Iter UDF error when output rows exceed input rows) > Improve Pandas Scalar Iter UDF error when output rows exceed input rows > --- > > Key: SPARK-49771 > URL: https://issues.apache.org/jira/browse/SPARK-49771 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-49771) Improve Pandas Iter UDF error when output rows exceed input rows
Allison Wang created SPARK-49771: Summary: Improve Pandas Iter UDF error when output rows exceed input rows Key: SPARK-49771 URL: https://issues.apache.org/jira/browse/SPARK-49771 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48999) [SS] Divide PythonStreamingDataSourceSimpleSuite
[ https://issues.apache.org/jira/browse/SPARK-48999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang resolved SPARK-48999. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47479 [https://github.com/apache/spark/pull/47479] > [SS] Divide PythonStreamingDataSourceSimpleSuite > > > Key: SPARK-48999 > URL: https://issues.apache.org/jira/browse/SPARK-48999 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.3 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > PythonStreamingDataSourceSimpleSuite runs too long. Divide it into several > suites. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48999) [SS] Divide PythonStreamingDataSourceSimpleSuite
[ https://issues.apache.org/jira/browse/SPARK-48999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang reassigned SPARK-48999: Assignee: Siying Dong > [SS] Divide PythonStreamingDataSourceSimpleSuite > > > Key: SPARK-48999 > URL: https://issues.apache.org/jira/browse/SPARK-48999 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 3.4.3 >Reporter: Siying Dong >Assignee: Siying Dong >Priority: Major > Labels: pull-request-available > > PythonStreamingDataSourceSimpleSuite runs too long. Divide it into several > suites. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48938) Improve error message when registering UDTFs
Allison Wang created SPARK-48938: Summary: Improve error message when registering UDTFs Key: SPARK-48938 URL: https://issues.apache.org/jira/browse/SPARK-48938 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Improve the error message when registering Python UDTFs -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48825) Unify the 'See Also' section formatting across PySpark docstrings
Allison Wang created SPARK-48825: Summary: Unify the 'See Also' section formatting across PySpark docstrings Key: SPARK-48825 URL: https://issues.apache.org/jira/browse/SPARK-48825 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Unify the 'See Also' section formatting across PySpark docstrings to make them consistent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48785) Add a simple data source example in the user guide
Allison Wang created SPARK-48785: Summary: Add a simple data source example in the user guide Key: SPARK-48785 URL: https://issues.apache.org/jira/browse/SPARK-48785 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48783) Update the table-valued function documentation
Allison Wang created SPARK-48783: Summary: Update the table-valued function documentation Key: SPARK-48783 URL: https://issues.apache.org/jira/browse/SPARK-48783 Project: Spark Issue Type: Sub-task Components: Documentation Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48479) Support creating temp SQL functions in parser
[ https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-48479: - Summary: Support creating temp SQL functions in parser (was: Support creating SQL functions in parser) > Support creating temp SQL functions in parser > - > > Key: SPARK-48479 > URL: https://issues.apache.org/jira/browse/SPARK-48479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Assignee: Allison Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Add Spark SQL parser for creating SQL functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48730) Support creating persistent SQL UDFs in parser
Allison Wang created SPARK-48730: Summary: Support creating persistent SQL UDFs in parser Key: SPARK-48730 URL: https://issues.apache.org/jira/browse/SPARK-48730 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48729) Add a UserDefinedFunction interface to represent a SQL function
Allison Wang created SPARK-48729: Summary: Add a UserDefinedFunction interface to represent a SQL function Key: SPARK-48729 URL: https://issues.apache.org/jira/browse/SPARK-48729 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48653) Fix Python data source error class references
Allison Wang created SPARK-48653: Summary: Fix Python data source error class references Key: SPARK-48653 URL: https://issues.apache.org/jira/browse/SPARK-48653 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Fix invalid error class references. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48497) Add user guide for batch data source write API
Allison Wang created SPARK-48497: Summary: Add user guide for batch data source write API Key: SPARK-48497 URL: https://issues.apache.org/jira/browse/SPARK-48497 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add examples for batch data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48479) Support creating SQL functions in parser
[ https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-48479: - Summary: Support creating SQL functions in parser (was: Support ccreating SQL functions in parser) > Support creating SQL functions in parser > > > Key: SPARK-48479 > URL: https://issues.apache.org/jira/browse/SPARK-48479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Add Spark SQL parser for creating SQL functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48479) Support ccreating SQL functions in parser
[ https://issues.apache.org/jira/browse/SPARK-48479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-48479: - Summary: Support ccreating SQL functions in parser (was: Add support for creating SQL functions in parser) > Support ccreating SQL functions in parser > - > > Key: SPARK-48479 > URL: https://issues.apache.org/jira/browse/SPARK-48479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Add Spark SQL parser for creating SQL functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48479) Add support for creating SQL functions in parser
Allison Wang created SPARK-48479: Summary: Add support for creating SQL functions in parser Key: SPARK-48479 URL: https://issues.apache.org/jira/browse/SPARK-48479 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang Add Spark SQL parser for creating SQL functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48205) Remove the private[sql] modifier for Python data sources
Allison Wang created SPARK-48205: Summary: Remove the private[sql] modifier for Python data sources Key: SPARK-48205 URL: https://issues.apache.org/jira/browse/SPARK-48205 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang To make it consistent with UDFs and UDTFs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48064) Improve error messages for routine related errors
Allison Wang created SPARK-48064: Summary: Improve error messages for routine related errors Key: SPARK-48064 URL: https://issues.apache.org/jira/browse/SPARK-48064 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48014) Change the makeFromJava error in EvaluatePython to a user-facing error
Allison Wang created SPARK-48014: Summary: Change the makeFromJava error in EvaluatePython to a user-facing error Key: SPARK-48014 URL: https://issues.apache.org/jira/browse/SPARK-48014 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47921) Fix ExecuteJobTag creation in ExecuteHolder
Allison Wang created SPARK-47921: Summary: Fix ExecuteJobTag creation in ExecuteHolder Key: SPARK-47921 URL: https://issues.apache.org/jira/browse/SPARK-47921 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47367) Support Python data source API with Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-47367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-47367: - Summary: Support Python data source API with Spark Connect (was: Support Python data source API in Spark Connect) > Support Python data source API with Spark Connect > - > > Key: SPARK-47367 > URL: https://issues.apache.org/jira/browse/SPARK-47367 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47367) Support Python data source API in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-47367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-47367: - Summary: Support Python data source API in Spark Connect (was: Support Python data source API with Spark Connect) > Support Python data source API in Spark Connect > --- > > Key: SPARK-47367 > URL: https://issues.apache.org/jira/browse/SPARK-47367 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47367) Support Python data source API with Spark Connect
Allison Wang created SPARK-47367: Summary: Support Python data source API with Spark Connect Key: SPARK-47367 URL: https://issues.apache.org/jira/browse/SPARK-47367 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47346) Make daemon mode configurable when creating Python workers
Allison Wang created SPARK-47346: Summary: Make daemon mode configurable when creating Python workers Key: SPARK-47346 URL: https://issues.apache.org/jira/browse/SPARK-47346 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46973) Skip V2 table lookup when a table is in V1 table cache
[ https://issues.apache.org/jira/browse/SPARK-46973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-46973: - Description: Improve v2 table lookup performance when a table is already in the v1 table cache. > Skip V2 table lookup when a table is in V1 table cache > -- > > Key: SPARK-46973 > URL: https://issues.apache.org/jira/browse/SPARK-46973 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Improve v2 table lookup performance when a table is already in the v1 table > cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46973) Skip V2 table lookup when a table is in V1 table cache
[ https://issues.apache.org/jira/browse/SPARK-46973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-46973: - Summary: Skip V2 table lookup when a table is in V1 table cache (was: Add table cache for V2 tables) > Skip V2 table lookup when a table is in V1 table cache > -- > > Key: SPARK-46973 > URL: https://issues.apache.org/jira/browse/SPARK-46973 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46973) Add table cache for V2 tables
Allison Wang created SPARK-46973: Summary: Add table cache for V2 tables Key: SPARK-46973 URL: https://issues.apache.org/jira/browse/SPARK-46973 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46818) Improve error messages for range with non-foldable input
Allison Wang created SPARK-46818: Summary: Improve error messages for range with non-foldable input Key: SPARK-46818 URL: https://issues.apache.org/jira/browse/SPARK-46818 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46618) Improve error messages for DATA_SOURCE_NOT_FOUND error
Allison Wang created SPARK-46618: Summary: Improve error messages for DATA_SOURCE_NOT_FOUND error Key: SPARK-46618 URL: https://issues.apache.org/jira/browse/SPARK-46618 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang Improve the error messages for DATA_SOURCE_NOT_FOUND error. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46616) Disallow re-registration of statically registered data sources
Allison Wang created SPARK-46616: Summary: Disallow re-registration of statically registered data sources Key: SPARK-46616 URL: https://issues.apache.org/jira/browse/SPARK-46616 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang This is a follow-up for SPARK-46522. Currently, Spark allows the re-registration of both statically and dynamically registered Python data sources. However, for (statically) registered Java/Scala data sources, Spark currently throws exceptions if users try to register the data soruce with the same name. We should make this behavior consistent: either allow re-registration of all statically loaded data sources, or disallow them. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46568) Python data source options should be a case insensitive dictionary
Allison Wang created SPARK-46568: Summary: Python data source options should be a case insensitive dictionary Key: SPARK-46568 URL: https://issues.apache.org/jira/browse/SPARK-46568 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Data source options are stored as a `CaseInsensitiveStringMap` in Scala, however, its behavior is inconsistent in Python: {code:java} class MyDataSource(DataSource): def __init__(self, options): self.api_key = options.get("API_KEY") # <- This is None spark.read.format(..).option("API_KEY", my_key).load(...){code} Currently, options will not have this "API_KEY" as everything is converted to lowercase on the Scala side. This can be confusing to users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46565) Improve Python data source error classes and messages
Allison Wang created SPARK-46565: Summary: Improve Python data source error classes and messages Key: SPARK-46565 URL: https://issues.apache.org/jira/browse/SPARK-46565 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46540) Respect column names when Python data source read function outputs named Row objects
[ https://issues.apache.org/jira/browse/SPARK-46540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-46540: - Summary: Respect column names when Python data source read function outputs named Row objects (was: Respect named arguments when Python data source read function outputs Row objects) > Respect column names when Python data source read function outputs named Row > objects > > > Key: SPARK-46540 > URL: https://issues.apache.org/jira/browse/SPARK-46540 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46540) Respects named arguments when Python data source read function outputs Row objects
Allison Wang created SPARK-46540: Summary: Respects named arguments when Python data source read function outputs Row objects Key: SPARK-46540 URL: https://issues.apache.org/jira/browse/SPARK-46540 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46540) Respect named arguments when Python data source read function outputs Row objects
[ https://issues.apache.org/jira/browse/SPARK-46540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-46540: - Summary: Respect named arguments when Python data source read function outputs Row objects (was: Respects named arguments when Python data source read function outputs Row objects) > Respect named arguments when Python data source read function outputs Row > objects > - > > Key: SPARK-46540 > URL: https://issues.apache.org/jira/browse/SPARK-46540 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46522) Block Python data source registration with name conflicts
Allison Wang created SPARK-46522: Summary: Block Python data source registration with name conflicts Key: SPARK-46522 URL: https://issues.apache.org/jira/browse/SPARK-46522 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Users should not be allowed to register Python data sources with names that are the same as builtin or existing Scala/Java data sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46520) Support overwrite mode for Python data source write
Allison Wang created SPARK-46520: Summary: Support overwrite mode for Python data source write Key: SPARK-46520 URL: https://issues.apache.org/jira/browse/SPARK-46520 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support the `overwrite` mode for Python data source -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46452) Add a new API in DSv2 DataWriter to write an iterator of records
Allison Wang created SPARK-46452: Summary: Add a new API in DSv2 DataWriter to write an iterator of records Key: SPARK-46452 URL: https://issues.apache.org/jira/browse/SPARK-46452 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang Add a new API that takes an iterator of records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46375) Add documentation for Python data source API
Allison Wang created SPARK-46375: Summary: Add documentation for Python data source API Key: SPARK-46375 URL: https://issues.apache.org/jira/browse/SPARK-46375 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add documentation (user guide) for Python data soruce API. Note the documentation should clarify the required dependency: pyarrow -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46290) Change saveMode to overwrite for DataSourceWriter constructor
Allison Wang created SPARK-46290: Summary: Change saveMode to overwrite for DataSourceWriter constructor Key: SPARK-46290 URL: https://issues.apache.org/jira/browse/SPARK-46290 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46273) Support INSERT INTO/OVERWRITE using DSv2 sources
Allison Wang created SPARK-46273: Summary: Support INSERT INTO/OVERWRITE using DSv2 sources Key: SPARK-46273 URL: https://issues.apache.org/jira/browse/SPARK-46273 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46272) Support CTAS using DSv2 sources
Allison Wang created SPARK-46272: Summary: Support CTAS using DSv2 sources Key: SPARK-46272 URL: https://issues.apache.org/jira/browse/SPARK-46272 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46253) Plan Python data source read using mapInArrow
Allison Wang created SPARK-46253: Summary: Plan Python data source read using mapInArrow Key: SPARK-46253 URL: https://issues.apache.org/jira/browse/SPARK-46253 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Instead of using a regular Python UDTF, we can actually use an arrow UDF and plan the data source read using the mapInArrow operator. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46057) Support SQL user-defined functions
[ https://issues.apache.org/jira/browse/SPARK-46057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-46057: - Description: This is an umbrella ticket to support SQL user-defined functions. (was: This is an umbrella ticket to support SQL user-defined functions in Spark.) > Support SQL user-defined functions > -- > > Key: SPARK-46057 > URL: https://issues.apache.org/jira/browse/SPARK-46057 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > This is an umbrella ticket to support SQL user-defined functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46057) Support SQL user-defined functions
Allison Wang created SPARK-46057: Summary: Support SQL user-defined functions Key: SPARK-46057 URL: https://issues.apache.org/jira/browse/SPARK-46057 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang This is an umbrella ticket to support SQL user-defined functions in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46043) Support create table using DSv2 sources
Allison Wang created SPARK-46043: Summary: Support create table using DSv2 sources Key: SPARK-46043 URL: https://issues.apache.org/jira/browse/SPARK-46043 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Allison Wang Support CREATE TABLE ... USING DSv2 sources. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46013) Improve basic datasource examples
Allison Wang created SPARK-46013: Summary: Improve basic datasource examples Key: SPARK-46013 URL: https://issues.apache.org/jira/browse/SPARK-46013 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang We should improve the Python examples on this page: [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html] (basic_datasource_examples.py) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45940) Add InputPartition to DataSourceReader interface
Allison Wang created SPARK-45940: Summary: Add InputPartition to DataSourceReader interface Key: SPARK-45940 URL: https://issues.apache.org/jira/browse/SPARK-45940 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add InputPartition class and make the partitions method return a list of input partitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45861) Add user guide for dataframe creation
[ https://issues.apache.org/jira/browse/SPARK-45861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17786487#comment-17786487 ] Allison Wang commented on SPARK-45861: -- [~panbingkun] again, thanks for working on this. Let me give you more details. When people search on Google for example "spark create dataframe", you can see there are many results, one of them being the PySpark documentation - createDataFrame. But there are many other ways to create a dataframe, for example from various data sources (CSV, JDBC, Parquet, etc), from pandas dataframe, from `spark.sql`, etc. We want to create a new documentation page under `{*}User Guides{*}` to explain all kinds of ways people can use to create a Spark data frame. It's different from the quickstart in that the user guide will provide more comprehensive examples. Feel free to take a look at the results when you search "spark create dataframe" or even "create dataframe" to get more inspirations. cc [~afolting] [~smilegator] > Add user guide for dataframe creation > - > > Key: SPARK-45861 > URL: https://issues.apache.org/jira/browse/SPARK-45861 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Add a simple user guide for data frame creation. > This user guide should cover the following APIs: > # df.createDataFrame > # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45931) Refine docstring of `mapInPandas`
Allison Wang created SPARK-45931: Summary: Refine docstring of `mapInPandas` Key: SPARK-45931 URL: https://issues.apache.org/jira/browse/SPARK-45931 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Refine the docstring of the mapInPandas function. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45930) Allow non-deterministic Python UDFs in MapInPandas/MapInArrow
Allison Wang created SPARK-45930: Summary: Allow non-deterministic Python UDFs in MapInPandas/MapInArrow Key: SPARK-45930 URL: https://issues.apache.org/jira/browse/SPARK-45930 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently if a Python udf is non-deterministic, the analyzer will fail with this error:[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a deterministic expression, but the actual expression is "pyUDF()", "a". SQLSTATE: 42K0E; -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45927) Update `path` handling in Python data source
[ https://issues.apache.org/jira/browse/SPARK-45927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45927: - Summary: Update `path` handling in Python data source (was: Remove `path` from data source constructor) > Update `path` handling in Python data source > > > Key: SPARK-45927 > URL: https://issues.apache.org/jira/browse/SPARK-45927 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should not make `path` an argument to the constructor of the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45927) Remove `path` from data source constructor
Allison Wang created SPARK-45927: Summary: Remove `path` from data source constructor Key: SPARK-45927 URL: https://issues.apache.org/jira/browse/SPARK-45927 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang We should not make `path` an argument to the constructor of the API. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45914) Support `commit` and `abort` API for Python data source write
Allison Wang created SPARK-45914: Summary: Support `commit` and `abort` API for Python data source write Key: SPARK-45914 URL: https://issues.apache.org/jira/browse/SPARK-45914 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support `commit` and `abort` API for Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45525) Initial support for Python data source write API
[ https://issues.apache.org/jira/browse/SPARK-45525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45525: - Description: Add a new command and logical rules (similar to V1Writes and V2Writes) to support Python data source write. (was: Support for Python data source write API) > Initial support for Python data source write API > > > Key: SPARK-45525 > URL: https://issues.apache.org/jira/browse/SPARK-45525 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Add a new command and logical rules (similar to V1Writes and V2Writes) to > support Python data source write. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make Python data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Description: Currently, registered data sources are stored in `sharedState` and can be accessed across multiple sessions. This, however, will not work with Spark Connect. We should make this registration session level, and support static registration (e.g. using pip install) in the future. (was: Currently we have added a few instance variables to store information for Python data source reader. We should have a dedicated reader class for Python data source to make the current DataFrameReader clean.) > Make Python data source registration session level > -- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, registered data sources are stored in `sharedState` and can be > accessed across multiple sessions. This, however, will not work with Spark > Connect. We should make this registration session level, and support static > registration (e.g. using pip install) in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Summary: Make data source registration session level (was: Separate the Python data source logic from DataFrameReader) > Make data source registration session level > --- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently we have added a few instance variables to store information for > Python data source reader. We should have a dedicated reader class for Python > data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make Python data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Summary: Make Python data source registration session level (was: Make data source registration session level) > Make Python data source registration session level > -- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently we have added a few instance variables to store information for > Python data source reader. We should have a dedicated reader class for Python > data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45865) Add user guide for window operations
Allison Wang created SPARK-45865: Summary: Add user guide for window operations Key: SPARK-45865 URL: https://issues.apache.org/jira/browse/SPARK-45865 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for window operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45864) Add user guide for groupby and aggregate
Allison Wang created SPARK-45864: Summary: Add user guide for groupby and aggregate Key: SPARK-45864 URL: https://issues.apache.org/jira/browse/SPARK-45864 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide to showcase common DataFrame operations involving group by and aggregate functions (min, max, count, sum, etc) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45863) Add user guide for column selections
Allison Wang created SPARK-45863: Summary: Add user guide for column selections Key: SPARK-45863 URL: https://issues.apache.org/jira/browse/SPARK-45863 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for column selections in PySpark. This should cover the following API: lit, df.col, and cover common column operations such as: removing a column from a data frame, adding new columns, dropping a duplicate column, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45862) Add user guide for basic dataframe operations
Allison Wang created SPARK-45862: Summary: Add user guide for basic dataframe operations Key: SPARK-45862 URL: https://issues.apache.org/jira/browse/SPARK-45862 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for basic DataFrame operations. This user guide should include the following APIs: select, filter, collect, show -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45861) Add user guide for dataframe creation
Allison Wang created SPARK-45861: Summary: Add user guide for dataframe creation Key: SPARK-45861 URL: https://issues.apache.org/jira/browse/SPARK-45861 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for data frame creation. This user guide should cover the following APIs: # df.createDataFrame # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45783) Improve exception message when no remote url is set
Allison Wang created SPARK-45783: Summary: Improve exception message when no remote url is set Key: SPARK-45783 URL: https://issues.apache.org/jira/browse/SPARK-45783 Project: Spark Issue Type: Improvement Components: Connect, PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Allison Wang When "SPARK_CONNECT_MODE_ENABLED" but no spark remote url is set, PySpark currently throws this exception: AttributeError: 'NoneType' object has no attribute 'startswith' We should improve this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45773) Refine docstring of `SparkSession.builder.config`
Allison Wang created SPARK-45773: Summary: Refine docstring of `SparkSession.builder.config` Key: SPARK-45773 URL: https://issues.apache.org/jira/browse/SPARK-45773 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Refine the docstring of SparkSession.builder.config -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
[ https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45765: - Description: Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02 This can be confusing but it's valid error message, as "p2" will be considered as the `format` field of the load() method. was: Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 We should fix this. > Improve error messages when loading multiple paths in PySpark > - > > Key: SPARK-45765 > URL: https://issues.apache.org/jira/browse/SPARK-45765 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, the error message is super confusing when a user tries to load > multiple paths incorrectly. > For example, `spark.read.format("json").load("p1", "p2")` will have this > error: > An error occurred while calling o36.load. > : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] > Failed to find the data source: p2. Please find packages at > `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02 > This can be confusing but it's valid error message, as "p2" will be > considered as the `format` field of the load() method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
[ https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang resolved SPARK-45765. -- Resolution: Invalid > Improve error messages when loading multiple paths in PySpark > - > > Key: SPARK-45765 > URL: https://issues.apache.org/jira/browse/SPARK-45765 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, the error message is super confusing when a user tries to load > multiple paths incorrectly. > For example, `spark.read.format("json").load("p1", "p2")` will have this > error: > An error occurred while calling o36.load. > : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] > Failed to find the data source: p2. Please find packages at > `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 > We should fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
Allison Wang created SPARK-45765: Summary: Improve error messages when loading multiple paths in PySpark Key: SPARK-45765 URL: https://issues.apache.org/jira/browse/SPARK-45765 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 We should fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45764) Make code block copyable
[ https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45764: - Description: We should consider adding a copy button next to the pyspark code blocks. For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] was: We should consider For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] > Make code block copyable > > > Key: SPARK-45764 > URL: https://issues.apache.org/jira/browse/SPARK-45764 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should consider adding a copy button next to the pyspark code blocks. > For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45764) Make code block copyable
[ https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781887#comment-17781887 ] Allison Wang commented on SPARK-45764: -- cc [~podongfeng] WDYT? > Make code block copyable > > > Key: SPARK-45764 > URL: https://issues.apache.org/jira/browse/SPARK-45764 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should consider adding a copy button next to the pyspark code blocks. > For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45764) Make code block copyable
Allison Wang created SPARK-45764: Summary: Make code block copyable Key: SPARK-45764 URL: https://issues.apache.org/jira/browse/SPARK-45764 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang We should consider For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45713) Support registering Python data sources
[ https://issues.apache.org/jira/browse/SPARK-45713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45713: - Description: Support registering Python data sources. Users can register a Python data source and later use reference it using its name. {code:java} class MyDataSource(DataSource): @classmethod def name(cls): return "my-data-source" spark.dataSource.register(MyDataSource){code} Users can then use the name of the data source as the format (will be supported in SPARK-45639) {code:java} spark.read.format("my-data-source").load(){code} was: Support registering Python data sources. Users can register a Python data source and later use reference it using its name. {code:java} class MyDataSource(DataSource): @classmethod def name(cls): return "my-data-source" spark.dataSource.register(MyDataSource){code} Users can then use the name of the data source as the format SPARK-45639 {code:java} spark.read.format("my-data-source").load(){code} > Support registering Python data sources > --- > > Key: SPARK-45713 > URL: https://issues.apache.org/jira/browse/SPARK-45713 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support registering Python data sources. > Users can register a Python data source and later use reference it using its > name. > {code:java} > class MyDataSource(DataSource): > @classmethod > def name(cls): > return "my-data-source" > spark.dataSource.register(MyDataSource){code} > Users can then use the name of the data source as the format (will be > supported in SPARK-45639) > {code:java} > spark.read.format("my-data-source").load(){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45713) Support registering Python data sources
[ https://issues.apache.org/jira/browse/SPARK-45713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45713: - Description: Support registering Python data sources. Users can register a Python data source and later use reference it using its name. {code:java} class MyDataSource(DataSource): @classmethod def name(cls): return "my-data-source" spark.dataSource.register(MyDataSource){code} Users can then use the name of the data source as the format SPARK-45639 {code:java} spark.read.format("my-data-source").load(){code} was: Support registering Python data sources. Users can register a Python data source and later use reference it using its name. {code:java} class MyDataSource(DataSource): @classmethod def name(cls): return "my-data-source" spark.dataSource.register(MyDataSource){code} Users can then use the name of the data source as the format {code:java} spark.read.format("my-data-source").load(){code} > Support registering Python data sources > --- > > Key: SPARK-45713 > URL: https://issues.apache.org/jira/browse/SPARK-45713 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support registering Python data sources. > Users can register a Python data source and later use reference it using its > name. > {code:java} > class MyDataSource(DataSource): > @classmethod > def name(cls): > return "my-data-source" > spark.dataSource.register(MyDataSource){code} > Users can then use the name of the data source as the format SPARK-45639 > {code:java} > spark.read.format("my-data-source").load(){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45713) Support registering Python data sources
Allison Wang created SPARK-45713: Summary: Support registering Python data sources Key: SPARK-45713 URL: https://issues.apache.org/jira/browse/SPARK-45713 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support registering Python data sources. Users can register a Python data source and later use reference it using its name. {code:java} class MyDataSource(DataSource): @classmethod def name(cls): return "my-data-source" spark.dataSource.register(MyDataSource){code} Users can then use the name of the data source as the format {code:java} spark.read.format("my-data-source").load(){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45639: - Description: Allow users to read from a Python data source using `spark.read.format(...).load()` in PySpark. For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: {code:java} class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ {code} was: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: {code:java} class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ {code} > Support loading Python data sources in DataFrameReader > -- > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Allow users to read from a Python data source using > `spark.read.format(...).load()` in PySpark. For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > {code:java} > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45639: - Summary: Support loading Python data sources in DataFrameReader (was: Support Python data source in DataFrameReader) > Support loading Python data sources in DataFrameReader > -- > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Allow users to read from a Python data source using > `spark.read.format(...).load()` > For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > {code:java} > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45654) Add Python data source write API
Allison Wang created SPARK-45654: Summary: Add Python data source write API Key: SPARK-45654 URL: https://issues.apache.org/jira/browse/SPARK-45654 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add Python data source write API in datasource.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45639: - Description: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: {code:java} class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ {code} was: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: {code:java} class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ {code} > Support Python data source in DataFrameReader > - > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Allow users to read from a Python data source using > `spark.read.format(...).load()` > For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > {code:java} > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45639: - Description: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: {code:java} class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ {code} was: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: ```python class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ ``` > Support Python data source in DataFrameReader > - > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Allow users to read from a Python data source using > `spark.read.format(...).load()` > For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > {code:java} > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support Python data source in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45639: - Description: Allow users to read from a Python data source using `spark.read.format(...).load()` For example Users can extend the DataSource and the DataSourceReader classes to create their own Python data source reader and use them in PySpark: ```python class MyReader(DataSourceReader): def read(self, partition): yield (0, 1) class MyDataSource(DataSource): def schema(self): return "id INT, value INT" def reader(self, schema): return MyReader() df = spark.read.format("MyDataSource").load() df.show() +---+-+ | id|value| +---+-+ | 0| 1| +---+-+ ``` was:Allow users to read from a Python data source using `spark.read.format(...).load()` > Support Python data source in DataFrameReader > - > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Allow users to read from a Python data source using > `spark.read.format(...).load()` > For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > ```python > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45639) Support Python data source in DataFrameReader
Allison Wang created SPARK-45639: Summary: Support Python data source in DataFrameReader Key: SPARK-45639 URL: https://issues.apache.org/jira/browse/SPARK-45639 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Allow users to read from a Python data source using `spark.read.format(...).load()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45524) Initial support for Python data source read API
[ https://issues.apache.org/jira/browse/SPARK-45524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45524: - Description: Add API for data source and data source reader and add Catalyst + execution support. was:Support Python data source API for reading data. > Initial support for Python data source read API > --- > > Key: SPARK-45524 > URL: https://issues.apache.org/jira/browse/SPARK-45524 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Add API for data source and data source reader and add Catalyst + execution > support. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1954#comment-1954 ] Allison Wang commented on SPARK-45023: -- [~abhinavofficial] this proposal is on hold, given the feedback received from the SPIP. > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45023) SPIP: Python Stored Procedures
[ https://issues.apache.org/jira/browse/SPARK-45023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang resolved SPARK-45023. -- Resolution: Won't Do > SPIP: Python Stored Procedures > -- > > Key: SPARK-45023 > URL: https://issues.apache.org/jira/browse/SPARK-45023 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Stored procedures are an extension of the ANSI SQL standard. They play a > crucial role in improving the capabilities of SQL by encapsulating complex > logic into reusable routines. > This proposal aims to extend Spark SQL by introducing support for stored > procedures, starting with Python as the procedural language. This addition > will allow users to execute procedural programs, leveraging programming > constructs of Python to perform tasks with complex logic. Additionally, users > can persist these procedural routines in catalogs such as HMS for future > reuse. By providing this functionality, we intend to seamlessly empower Spark > users to integrate with Python routines within their SQL workflows. > {*}SPIP{*}: > [https://docs.google.com/document/d/1ce2EZrf2BxHu7TjfGn4TgToK3TBYYzRkmsIVcfmkNzE/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45600) Separate the Python data source logic from DataFrameReader
Allison Wang created SPARK-45600: Summary: Separate the Python data source logic from DataFrameReader Key: SPARK-45600 URL: https://issues.apache.org/jira/browse/SPARK-45600 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently we have added a few instance variables to store information for Python data source reader. We should have a dedicated reader class for Python data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API
[ https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45559: - Description: Support `spark.read.schema(...)` for Python data source read. Add test cases where we send the schema as a string instead of StructType, and a positive case as well as a negative case where it doesn't parse successfully with fromDDL? was:Support `spark.read.schema(...)` for Python data source read > Support spark.read.schema(...) for Python data source API > - > > Key: SPARK-45559 > URL: https://issues.apache.org/jira/browse/SPARK-45559 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support `spark.read.schema(...)` for Python data source read. > Add test cases where we send the schema as a string instead of StructType, > and a positive case as well as a negative case where it doesn't parse > successfully with fromDDL? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45597) Support creating table using a Python data source in SQL
Allison Wang created SPARK-45597: Summary: Support creating table using a Python data source in SQL Key: SPARK-45597 URL: https://issues.apache.org/jira/browse/SPARK-45597 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support creating table using a Python data source in SQL query: For instance: `CREATE TABLE tableName() USING OPTIONS ` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec
[ https://issues.apache.org/jira/browse/SPARK-45584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45584: - Description: When there are subqueries in TakeOrderedAndProjectExec, the query can throw this exception: java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, [id=#109|#109] has not finished This is because TakeOrderedAndProjectExec does not wait for subquery execution. was: When there are subqueries in TakeOrderedAndProjectExec, the query can throw this exception: java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, [id=#109] has not finished > Execution fails when there are subqueries in TakeOrderedAndProjectExec > -- > > Key: SPARK-45584 > URL: https://issues.apache.org/jira/browse/SPARK-45584 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Allison Wang >Priority: Major > > When there are subqueries in TakeOrderedAndProjectExec, the query can throw > this exception: > java.lang.IllegalArgumentException: requirement failed: Subquery > subquery#242, [id=#109|#109] has not finished > This is because TakeOrderedAndProjectExec does not wait for subquery > execution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45584) Execution fails when there are subqueries in TakeOrderedAndProjectExec
Allison Wang created SPARK-45584: Summary: Execution fails when there are subqueries in TakeOrderedAndProjectExec Key: SPARK-45584 URL: https://issues.apache.org/jira/browse/SPARK-45584 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0 Reporter: Allison Wang When there are subqueries in TakeOrderedAndProjectExec, the query can throw this exception: java.lang.IllegalArgumentException: requirement failed: Subquery subquery#242, [id=#109] has not finished -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45560) Support spark.read.load() with non-empty path for Python data source API
[ https://issues.apache.org/jira/browse/SPARK-45560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45560: - Summary: Support spark.read.load() with non-empty path for Python data source API (was: Support spark.read.load() with paths for Python data source API) > Support spark.read.load() with non-empty path for Python data source API > > > Key: SPARK-45560 > URL: https://issues.apache.org/jira/browse/SPARK-45560 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support non-empty path for Python data source read: > `spark.read.format(..).load(path)` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45559) Support spark.read.schema(...) for Python data source API
[ https://issues.apache.org/jira/browse/SPARK-45559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45559: - Summary: Support spark.read.schema(...) for Python data source API (was: Support df.read.schema(...) for Python data source API) > Support spark.read.schema(...) for Python data source API > - > > Key: SPARK-45559 > URL: https://issues.apache.org/jira/browse/SPARK-45559 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Support `spark.read.schema(...)` for Python data source read -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45560) Support spark.read.load() with paths for Python data source API
Allison Wang created SPARK-45560: Summary: Support spark.read.load() with paths for Python data source API Key: SPARK-45560 URL: https://issues.apache.org/jira/browse/SPARK-45560 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support non-empty path for Python data source read: `spark.read.format(..).load(path)` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45559) Support df.read.schema(...) for Python data source API
Allison Wang created SPARK-45559: Summary: Support df.read.schema(...) for Python data source API Key: SPARK-45559 URL: https://issues.apache.org/jira/browse/SPARK-45559 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support `spark.read.schema(...)` for Python data source read -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45526) Refine docstring of `options` for dataframe reader and writer
Allison Wang created SPARK-45526: Summary: Refine docstring of `options` for dataframe reader and writer Key: SPARK-45526 URL: https://issues.apache.org/jira/browse/SPARK-45526 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Refine the docstring of the `options` method of DataFrameReader and DataFrameWriter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45525) Initial support for Python data source write API
Allison Wang created SPARK-45525: Summary: Initial support for Python data source write API Key: SPARK-45525 URL: https://issues.apache.org/jira/browse/SPARK-45525 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support for Python data source write API -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45524) Initial support for Python data source read API
Allison Wang created SPARK-45524: Summary: Initial support for Python data source read API Key: SPARK-45524 URL: https://issues.apache.org/jira/browse/SPARK-45524 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Support Python data source API for reading data. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join
[ https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45509: - Description: SPARK-45220 discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect. For instance, here is the query that works without Spark Connect: {code:java} df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", height=85)]){code} {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. was: SPARK-45220 discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect. For instance, here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. > Investigate the behavior difference in self-join > > > Key: SPARK-45509 > URL: https://issues.apache.org/jira/browse/SPARK-45509 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > SPARK-45220 discovers a behavior difference for a self-join scenario between > classic Spark and Spark Connect. > For instance, here is the query that works without Spark Connect: > {code:java} > df = spark.createDataFrame([Row(name="Alice", age=2), Row(name="Bob", age=5)]) > df2 = spark.createDataFrame([Row(name="Tom", height=80), Row(name="Bob", > height=85)]){code} > {code:java} > joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) > joined.show(){code} > But in Spark Connect, it throws this exception: > {code:java} > pyspark.errors.exceptions.connect.AnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter > with name `name` cannot be resolved. Did you mean one of the following? > [`name`, `name`, `age`, `height`].; > 'Sort ['name DESC NULLS LAST], true > +- Join FullOuter, (name#64 = name#78) >:- LocalRelation [name#64, age#65L] >+- LocalRelation [name#78, height#79L] > {code} > > On the other hand, this query failed in classic Spark Connect: > {code:java} > df.join(df, df.name == df.name, "outer").select(df.name).show() {code} > {code:java} > pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are > ambiguous... {code} > > but this query works with Spark Connect. > We need to investigate the behavior difference and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join
[ https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45509: - Description: SPARK-45220 discovers a behavior difference for a self-join scenario between classic Spark and Spark Connect. For instance, here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. was: SPARK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect. For instance. here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. > Investigate the behavior difference in self-join > > > Key: SPARK-45509 > URL: https://issues.apache.org/jira/browse/SPARK-45509 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > SPARK-45220 discovers a behavior difference for a self-join scenario between > classic Spark and Spark Connect. > For instance, here is the query that works without Spark Connect: > {code:java} > joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) > joined.show(){code} > But in Spark Connect, it throws this exception: > {code:java} > pyspark.errors.exceptions.connect.AnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter > with name `name` cannot be resolved. Did you mean one of the following? > [`name`, `name`, `age`, `height`].; > 'Sort ['name DESC NULLS LAST], true > +- Join FullOuter, (name#64 = name#78) >:- LocalRelation [name#64, age#65L] >+- LocalRelation [name#78, height#79L] > {code} > > On the other hand, this query failed in classic Spark Connect: > {code:java} > df.join(df, df.name == df.name, "outer").select(df.name).show() {code} > {code:java} > pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are > ambiguous... {code} > > but this query works with Spark Connect. > We need to investigate the behavior difference and fix it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45509) Investigate the behavior difference in self-join
[ https://issues.apache.org/jira/browse/SPARK-45509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45509: - Description: SPARK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect. For instance. here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. was: SAPRK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect. For instance. here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. > Investigate the behavior difference in self-join > > > Key: SPARK-45509 > URL: https://issues.apache.org/jira/browse/SPARK-45509 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0, 4.0.0 >Reporter: Allison Wang >Priority: Major > > SPARK-45220 discovers a behavior difference for a self-join scenario between > class Spark and Spark Connect. > For instance. here is the query that works without Spark Connect: > > {code:java} > joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) > joined.show(){code} > > But in Spark Connect, it throws this exception: > > {code:java} > pyspark.errors.exceptions.connect.AnalysisException: > [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter > with name `name` cannot be resolved. Did you mean one of the following? > [`name`, `name`, `age`, `height`].; > 'Sort ['name DESC NULLS LAST], true > +- Join FullOuter, (name#64 = name#78) >:- LocalRelation [name#64, age#65L] >+- LocalRelation [name#78, height#79L] > {code} > > On the other hand, this query failed in classic Spark Connect: > > {code:java} > df.join(df, df.name == df.name, "outer").select(df.name).show() {code} > {code:java} > pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are > ambiguous... {code} > > but this query works with Spark Connect. > We need to investigate the behavior difference and fix it. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45509) Investigate the behavior difference in self-join
Allison Wang created SPARK-45509: Summary: Investigate the behavior difference in self-join Key: SPARK-45509 URL: https://issues.apache.org/jira/browse/SPARK-45509 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0, 4.0.0 Reporter: Allison Wang SAPRK-45220 discovers a behavior difference for a self-join scenario between class Spark and Spark Connect. For instance. here is the query that works without Spark Connect: {code:java} joined = df.join(df2, df.name == df2.name, "outer").sort(sf.desc(df.name)) joined.show(){code} But in Spark Connect, it throws this exception: {code:java} pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `name` cannot be resolved. Did you mean one of the following? [`name`, `name`, `age`, `height`].; 'Sort ['name DESC NULLS LAST], true +- Join FullOuter, (name#64 = name#78) :- LocalRelation [name#64, age#65L] +- LocalRelation [name#78, height#79L] {code} On the other hand, this query failed in classic Spark Connect: {code:java} df.join(df, df.name == df.name, "outer").select(df.name).show() {code} {code:java} pyspark.errors.exceptions.captured.AnalysisException: Column name#0 are ambiguous... {code} but this query works with Spark Connect. We need to investigate the behavior difference and fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org