[jira] [Updated] (SPARK-49145) Improve readability of log4j console log output

2024-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-49145:
---
Description: 
Prior to this update, the OSS Spark logs were difficult to interpret. The logs 
followed a JSON output format which is not optimal for human consumption:

 
{code:java}
{"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 
4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS
 info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac 
OS 
X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java
 version 
17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable
 to load native-hadoop library for your platform... using builtin-java classes 
where 
applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No
 custom resources configured for 
spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted
 application: Spark Pi","context":{"app_name":"Spark 
Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start
 Jetty 0.0.0.0:4040 for 
SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20;
 built: 2024-01-29T21:04:22.394Z; git: 
922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 
17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started
 Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code}
 

 

This issue updates the default `log4j.properties.template` with the following 
improvements for console logging format:
 * Use PatternLayout for improved human readability
 * Color-code log levels to simplify logging output
 * Visually partition the threadName and contextInfo for easy interpretation

  was:
Prior to this update, the OSS Spark logs were difficult to interpret. The logs 
followed a JSON output format which is not optimal for human consumption:

 
{code:java}
{"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 
4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS
 info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac 
OS 
X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java
 version 
17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable
 to load native-hadoop library for your platform... using builtin-java classes 
where 
applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No
 custom resources configured for 
spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted
 application: Spark Pi","context":{"app_name":"Spark 
Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start
 Jetty 0.0.0.0:4040 for 
SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20;
 built: 2024-01-29T21:04:22.394Z; git: 
922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 
17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started
 Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code}
 

 

This effort updates the default `log4j.properties.template` with the following 
improvements for console logging format:
 * Use PatternLayout for improved human readability
 * Color-code log levels to simplify logging output
 * Visually partition the threadName and contextInfo for easy interpretation


> Improve readability of log4j console log output
> ---
>
> Key: SPARK-49145
> URL: https://issues.apache.org/jira/browse/SPARK-49145
> Project: Spark
>  Issue Type: Task
>  

[jira] [Updated] (SPARK-49145) Improve readability of log4j console log output

2024-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-49145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-49145:
---
Description: 
Prior to this update, the OSS Spark logs were difficult to interpret. The logs 
followed a JSON output format which is not optimal for human consumption:

 
{code:java}
{"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 
4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS
 info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac 
OS 
X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java
 version 
17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable
 to load native-hadoop library for your platform... using builtin-java classes 
where 
applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No
 custom resources configured for 
spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted
 application: Spark Pi","context":{"app_name":"Spark 
Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start
 Jetty 0.0.0.0:4040 for 
SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20;
 built: 2024-01-29T21:04:22.394Z; git: 
922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 
17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started
 Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code}
 

 

This effort updates the default `log4j.properties.template` with the following 
improvements for console logging format:
 * Use PatternLayout for improved human readability
 * Color-code log levels to simplify logging output
 * Visually partition the threadName and contextInfo for easy interpretation

  was:
Prior to this update, the OSS Spark logs were difficult to interpret. The logs 
followed a JSON output format which is not optimal for human consumption:

 
{code:java}
{"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 
4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS
 info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac 
OS 
X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java
 version 
17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable
 to load native-hadoop library for your platform... using builtin-java classes 
where 
applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No
 custom resources configured for 
spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted
 application: Spark Pi","context":{"app_name":"Spark 
Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start
 Jetty 0.0.0.0:4040 for 
SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20;
 built: 2024-01-29T21:04:22.394Z; git: 
922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 
17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started
 Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code}
 

 

This PR updates the default `log4j.properties.template` with the following 
improvements for console logging format:
 * Use PatternLayout for improved human readability
 * Color-code log levels to simplify logging output
 * Visually partition the threadName and contextInfo for easy interpretation


> Improve readability of log4j console log output
> ---
>
> Key: SPARK-49145
> URL: https://issues.apache.org/jira/browse/SPARK-49145
> Project: Spark
>  Issue Type: Task
>  

[jira] [Created] (SPARK-49145) Improve readability of log4j console log output

2024-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-49145:
--

 Summary: Improve readability of log4j console log output
 Key: SPARK-49145
 URL: https://issues.apache.org/jira/browse/SPARK-49145
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu


Prior to this update, the OSS Spark logs were difficult to interpret. The logs 
followed a JSON output format which is not optimal for human consumption:

 
{code:java}
{"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 
4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS
 info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac 
OS 
X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java
 version 
17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable
 to load native-hadoop library for your platform... using builtin-java classes 
where 
applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No
 custom resources configured for 
spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted
 application: Spark Pi","context":{"app_name":"Spark 
Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start
 Jetty 0.0.0.0:4040 for 
SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20;
 built: 2024-01-29T21:04:22.394Z; git: 
922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 
17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started
 Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code}
 

 

This PR updates the default `log4j.properties.template` with the following 
improvements for console logging format:
 * Use PatternLayout for improved human readability
 * Color-code log levels to simplify logging output
 * Visually partition the threadName and contextInfo for easy interpretation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48759) Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4

2024-06-30 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-48759:
---
Summary: Add migration doc for CREATE TABLE AS SELECT behavior change 
behavior change since Spark 3.4  (was: Add migration doc for CREATE TABLE 
behavior change behavior change since Spark 3.4)

> Add migration doc for CREATE TABLE AS SELECT behavior change behavior change 
> since Spark 3.4
> 
>
> Key: SPARK-48759
> URL: https://issues.apache.org/jira/browse/SPARK-48759
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Amanda Liu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48759) Add migration doc for CREATE TABLE behavior change behavior change since Spark 3.4

2024-06-30 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48759:
--

 Summary: Add migration doc for CREATE TABLE behavior change 
behavior change since Spark 3.4
 Key: SPARK-48759
 URL: https://issues.apache.org/jira/browse/SPARK-48759
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48740) Catch missing window specification error early

2024-06-27 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48740:
--

 Summary: Catch missing window specification error early
 Key: SPARK-48740
 URL: https://issues.apache.org/jira/browse/SPARK-48740
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu


Before, aggregate queries containing a window function without a window 
specification (e.g. `PARTITION BY`) would return a non-descriptive internal 
error message: 

`org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] 
Invalid call to exprId on unresolved object SQLSTATE: XX000`

This PR catches the user error early and returns a more accurate description of 
the issue:

`Window specification  is not defined in the WINDOW clause.`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48676) Structured Logging Framework Scala Style Migration [Part 2]

2024-06-20 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48676:
--

 Summary: Structured Logging Framework Scala Style Migration [Part 
2]
 Key: SPARK-48676
 URL: https://issues.apache.org/jira/browse/SPARK-48676
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48632) Remove unused LogKeys

2024-06-14 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu resolved SPARK-48632.

Resolution: Not A Problem

> Remove unused LogKeys
> -
>
> Key: SPARK-48632
> URL: https://issues.apache.org/jira/browse/SPARK-48632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Amanda Liu
>Priority: Major
>  Labels: pull-request-available
>
> Remove unused LogKey objects to clean up LogKey.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48632) Remove unused LogKeys

2024-06-14 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48632:
--

 Summary: Remove unused LogKeys
 Key: SPARK-48632
 URL: https://issues.apache.org/jira/browse/SPARK-48632
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu


Remove unused LogKey objects to clean up LogKey.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48623) Structured Logging Framework Scala Style Migration

2024-06-13 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48623:
--

 Summary: Structured Logging Framework Scala Style Migration
 Key: SPARK-48623
 URL: https://issues.apache.org/jira/browse/SPARK-48623
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48592) Add scala style check for logging message inline variables

2024-06-11 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-48592:
--

 Summary: Add scala style check for logging message inline variables
 Key: SPARK-48592
 URL: https://issues.apache.org/jira/browse/SPARK-48592
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Amanda Liu


Ban logging messages using logInfo, logWarning, logError containing variables 
without {{MDC}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation

2024-01-29 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-46910:
--

 Summary: Eliminate JDK Requirement in PySpark Installation
 Key: SPARK-46910
 URL: https://issues.apache.org/jira/browse/SPARK-46910
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; JDK 
17+ for Spark>=4) installed locally.

We can make the Spark installation script install the JDK, so users don’t need 
to do this step manually.
h1. Details
 # When the entry point for a Spark class is invoked, the spark-class script 
checks if Java is installed in the user environment.

 # If Java is not installed, the user is prompted to select whether they want 
to install JDK 17.

 # If the user selects yes, JDK 17 is installed (using the [install-jdk 
library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and 
RUNNER are set appropriately. The Spark build will now work!

 # If the user selects no, we provide them a brief description of how to 
install JDK manually.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45729) Fix PySpark testing guide links

2023-10-30 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-45729:
--

 Summary: Fix PySpark testing guide links
 Key: SPARK-45729
 URL: https://issues.apache.org/jira/browse/SPARK-45729
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44712) Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44712:
---
Description: Migrate assert_eq to assertDataFrameEqual in this file: 
[‎python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176]
  (was: Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46])

> Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual
> -
>
> Key: SPARK-44712
> URL: https://issues.apache.org/jira/browse/SPARK-44712
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Migrate assert_eq to assertDataFrameEqual in this file: 
> [‎python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44712) Migrate ‎test_timedelta_ops assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44712:
--

 Summary: Migrate ‎test_timedelta_ops assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44712
 URL: https://issues.apache.org/jira/browse/SPARK-44712
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44711:
---
Description: 
Migrate assert_eq to assertDataFrameEqual in this file: 
[‎python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63]

  was:Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]


> Migrate test_series_conversion assert_eq to use assertDataFrameEqual
> 
>
> Key: SPARK-44711
> URL: https://issues.apache.org/jira/browse/SPARK-44711
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Migrate assert_eq to assertDataFrameEqual in this file: 
> [‎python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44711:
--

 Summary: Migrate test_series_conversion assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44711
 URL: https://issues.apache.org/jira/browse/SPARK-44711
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44708) Migrate test_reset_index assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44708:
--

 Summary: Migrate test_reset_index assert_eq to use 
assertDataFrameEqual
 Key: SPARK-44708
 URL: https://issues.apache.org/jira/browse/SPARK-44708
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate assert_eq to assertDataFrameEqual in this file: 
[python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44597:
---
Description: Migrate tests to new test utils in this file: 
python/pyspark/pandas/tests/test_sql.py  (was: The Jira ticket [SPARK-44042] 
SPIP: PySpark Test Framework  introduces a new PySpark test framework. Some of 
the user-facing testing util APIs include assertDataFrameEqual, 
assertSchemaEqual, and assertPandasOnSparkEqual.

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.)

> Migrate test_sql assert_eq to use assertDataFrameEqual
> --
>
> Key: SPARK-44597
> URL: https://issues.apache.org/jira/browse/SPARK-44597
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Migrate tests to new test utils in this file: 
> python/pyspark/pandas/tests/test_sql.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils

2023-08-07 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44589:
---
Description: 
The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new 
PySpark test framework. Some of the user-facing testing util APIs include 
assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.

  was:Migrate existing tests in the PySpark codebase to use the new PySpark 
test utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042


> Migrate PySpark tests to use PySpark built-in test utils
> 
>
> Key: SPARK-44589
> URL: https://issues.apache.org/jira/browse/SPARK-44589
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new 
> PySpark test framework. Some of the user-facing testing util APIs include 
> assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.
> With the new testing framework, we should migrate old tests in the Spark 
> codebase to use the new testing utils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44682) Make pandas error class message_parameters strings

2023-08-04 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44682:
--

 Summary: Make pandas error class message_parameters strings
 Key: SPARK-44682
 URL: https://issues.apache.org/jira/browse/SPARK-44682
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44548) Add support for pandas-on-Spark DataFrame assertDataFrameEqual

2023-08-03 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44548:
---
Summary: Add support for pandas-on-Spark DataFrame assertDataFrameEqual  
(was: Add support for pandas DataFrame assertDataFrameEqual)

> Add support for pandas-on-Spark DataFrame assertDataFrameEqual
> --
>
> Key: SPARK-44548
> URL: https://issues.apache.org/jira/browse/SPARK-44548
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual

2023-08-03 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44665:
--

 Summary: Add support for pandas DataFrame assertDataFrameEqual
 Key: SPARK-44665
 URL: https://issues.apache.org/jira/browse/SPARK-44665
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44652) Raise error when only one df is None

2023-08-02 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44652:
--

 Summary: Raise error when only one df is None
 Key: SPARK-44652
 URL: https://issues.apache.org/jira/browse/SPARK-44652
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44645) Update assertDataFrameEqual docs error example output

2023-08-02 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44645:
---
Summary: Update assertDataFrameEqual docs error example output  (was: 
Update assertDataFrame docs error example output)

> Update assertDataFrameEqual docs error example output
> -
>
> Key: SPARK-44645
> URL: https://issues.apache.org/jira/browse/SPARK-44645
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44645) Update assertDataFrame docs error example output

2023-08-02 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44645:
--

 Summary: Update assertDataFrame docs error example output
 Key: SPARK-44645
 URL: https://issues.apache.org/jira/browse/SPARK-44645
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44629) Publish PySpark Test Guidelines webpage

2023-08-01 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44629:
--

 Summary: Publish PySpark Test Guidelines webpage
 Key: SPARK-44629
 URL: https://issues.apache.org/jira/browse/SPARK-44629
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44617) Support comparison between list of Rows

2023-07-31 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44617:
--

 Summary: Support comparison between list of Rows
 Key: SPARK-44617
 URL: https://issues.apache.org/jira/browse/SPARK-44617
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44617) Support comparison between lists of Rows

2023-07-31 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44617:
---
Summary: Support comparison between lists of Rows  (was: Support comparison 
between list of Rows)

> Support comparison between lists of Rows
> 
>
> Key: SPARK-44617
> URL: https://issues.apache.org/jira/browse/SPARK-44617
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual

2023-07-31 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44597:
---
Description: 
The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework 
|https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark 
test framework. Some of the user-facing testing util APIs include 
assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.

 

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.

> Migrate test_sql assert_eq to use assertDataFrameEqual
> --
>
> Key: SPARK-44597
> URL: https://issues.apache.org/jira/browse/SPARK-44597
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework 
> |https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark 
> test framework. Some of the user-facing testing util APIs include 
> assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.
>  
> With the new testing framework, we should migrate old tests in the Spark 
> codebase to use the new testing utils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual

2023-07-31 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44597:
---
Description: 
The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework  introduces a new 
PySpark test framework. Some of the user-facing testing util APIs include 
assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.

  was:
The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework 
|https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark 
test framework. Some of the user-facing testing util APIs include 
assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.

 

With the new testing framework, we should migrate old tests in the Spark 
codebase to use the new testing utils.


> Migrate test_sql assert_eq to use assertDataFrameEqual
> --
>
> Key: SPARK-44597
> URL: https://issues.apache.org/jira/browse/SPARK-44597
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework  introduces a new 
> PySpark test framework. Some of the user-facing testing util APIs include 
> assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual.
> With the new testing framework, we should migrate old tests in the Spark 
> codebase to use the new testing utils.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44603) Add pyspark.testing to setup.py

2023-07-30 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44603:
---
Summary: Add pyspark.testing to setup.py  (was: Add pyspark.testing.utils 
to setup.py)

> Add pyspark.testing to setup.py
> ---
>
> Key: SPARK-44603
> URL: https://issues.apache.org/jira/browse/SPARK-44603
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44603) Add pyspark.testing.utils to Python Setup

2023-07-30 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44603:
--

 Summary: Add pyspark.testing.utils to Python Setup
 Key: SPARK-44603
 URL: https://issues.apache.org/jira/browse/SPARK-44603
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44603) Add pyspark.testing.utils to setup.py

2023-07-30 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44603:
---
Summary: Add pyspark.testing.utils to setup.py  (was: Add 
pyspark.testing.utils to Python Setup)

> Add pyspark.testing.utils to setup.py
> -
>
> Key: SPARK-44603
> URL: https://issues.apache.org/jira/browse/SPARK-44603
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual

2023-07-29 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44597:
--

 Summary: Migrate test_sql assert_eq to use assertDataFrameEqual
 Key: SPARK-44597
 URL: https://issues.apache.org/jira/browse/SPARK-44597
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44596) Fix pandas-on-Spark type checks for assertDataFrameEqual

2023-07-29 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44596:
--

 Summary: Fix pandas-on-Spark type checks for assertDataFrameEqual
 Key: SPARK-44596
 URL: https://issues.apache.org/jira/browse/SPARK-44596
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu
Assignee: Amanda Liu
 Fix For: 3.5.0, 4.0.0


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils

2023-07-28 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44589:
--

 Summary: Migrate PySpark tests to use PySpark built-in test utils
 Key: SPARK-44589
 URL: https://issues.apache.org/jira/browse/SPARK-44589
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Migrate existing tests in the PySpark codebase to use the new PySpark test 
utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44218) Customize diff log in assertDataFrameEqual error message format

2023-07-28 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44218:
---
Summary: Customize diff log in assertDataFrameEqual error message format  
(was: Customize context_diff in assertDataFrameEqual error message format)

> Customize diff log in assertDataFrameEqual error message format
> ---
>
> Key: SPARK-44218
> URL: https://issues.apache.org/jira/browse/SPARK-44218
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44218) Customize context_diff in assertDataFrameEqual error message format

2023-07-28 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44218:
---
Summary: Customize context_diff in assertDataFrameEqual error message 
format  (was: Add improved error message formatting for 
assert_approx_df_equality)

> Customize context_diff in assertDataFrameEqual error message format
> ---
>
> Key: SPARK-44218
> URL: https://issues.apache.org/jira/browse/SPARK-44218
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs.

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs (see 
[https://databricks.atlassian.net/browse/ES-705815]).

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is 

[jira] [Created] (SPARK-44548) Add support for pandas DataFrame assertDataFrameEqual

2023-07-25 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44548:
--

 Summary: Add support for pandas DataFrame assertDataFrameEqual
 Key: SPARK-44548
 URL: https://issues.apache.org/jira/browse/SPARK-44548
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu
Assignee: Amanda Liu
 Fix For: 3.5.0, 4.0.0


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. Historically, PySpark has had code regressions due to 
insufficient testing of public APIs (see 
[https://databricks.atlassian.net/browse/ES-705815]).

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing 

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Strings
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. Strings
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # String
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many 

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # String
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

  was:
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype


> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many of these edge 

[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44546:
---
Description: 
h2. Summary

This ticket adds a dev utility script to help generate PySpark tests using LLM 
response. The purpose of this experimental script is to encourage PySpark 
developers to test their code thoroughly, to avoid introducing regressions in 
the codebase. 

Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
from the perspective of arguments. Many of these edge cases are passed into the 
LLM through the script's base prompt.

Please note that this list is not exhaustive, but rather a starting point. Some 
of these cases may not apply, depending on the situation. We encourage all 
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
 # None
 # Ints
 # Floats
 # Single column / column name
 # Multi column / column names
 # DataFrame argument

h3. 1. None
 * Empty input
 * None type

h3. 2. Ints
 * Negatives
 * 0
 * value > Int.MaxValue
 * value < Int.MinValue

h3. 3. Floats
 * Negatives
 * 0.0
 * Float(“nan”)
 * Float("inf")
 * Float("-inf")
 * decimal.Decimal
 * numpy.float16

h3. 4. String
 * Special characters
 * Spaces
 * Empty strings

h3. 5. Single column / column name
 * Non-existent column
 * Empty column name
 * Column name with special characters, e.g. dots
 * Multi columns with the same name
 * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
 * Column of special types, e.g. nested type;
 * Column containing special values, e.g. Null;

h3. 6. Multi column / column names
 * Empty input; e.g DataFrame.drop()
 * Special cases for each single column
 * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
 * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))

h3. 7. DataFrame argument
 * DataFrame argument
 * Empty dataframe; e.g. spark.range(5).limit(0)
 * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
 * Dataset with repeated arguments
 * Local dataset (pd.DataFrame) containing unsupported datatype

> Add a dev utility to generate PySpark tests with LLM
> 
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using 
> LLM response. The purpose of this experimental script is to encourage PySpark 
> developers to test their code thoroughly, to avoid introducing regressions in 
> the codebase. 
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs, 
> from the perspective of arguments. Many of these edge cases are passed into 
> the LLM through the script's base prompt.
> Please note that this list is not exhaustive, but rather a starting point. 
> Some of these cases may not apply, depending on the situation. We encourage 
> all PySpark developers to carefully consider edge case scenarios when writing 
> tests.
> h2. Table of Contents
>  # None
>  # Ints
>  # Floats
>  # Single column / column name
>  # Multi column / column names
>  # DataFrame argument
> h3. 1. None
>  * Empty input
>  * None type
> h3. 2. Ints
>  * Negatives
>  * 0
>  * value > Int.MaxValue
>  * value < Int.MinValue
> h3. 3. Floats
>  * Negatives
>  * 0.0
>  * Float(“nan”)
>  * Float("inf")
>  * Float("-inf")
>  * decimal.Decimal
>  * numpy.float16
> h3. 4. String
>  * Special characters
>  * Spaces
>  * Empty strings
> h3. 5. Single column / column name
>  * Non-existent column
>  * Empty column name
>  * Column name with special characters, e.g. dots
>  * Multi columns with the same name
>  * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
>  * Column of special types, e.g. nested type;
>  * Column containing special values, e.g. Null;
> h3. 6. Multi column / column names
>  * Empty input; e.g DataFrame.drop()
>  * Special cases for each single column
>  * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
>  * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))
> h3. 7. DataFrame argument
>  * DataFrame argument
>  * Empty dataframe; e.g. spark.range(5).limit(0)
>  * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
>  * Dataset with repeated arguments
>  * Local dataset (pd.DataFrame) containing unsupported datatype



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM

2023-07-25 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44546:
--

 Summary: Add a dev utility to generate PySpark tests with LLM
 Key: SPARK-44546
 URL: https://issues.apache.org/jira/browse/SPARK-44546
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44061) Add assertDataFrameEqual util function

2023-07-17 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44061:
---
Summary: Add assertDataFrameEqual util function  (was: Add 
assertDataFrameEquality util function)

> Add assertDataFrameEqual util function
> --
>
> Key: SPARK-44061
> URL: https://issues.apache.org/jira/browse/SPARK-44061
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44453) Use difflib to display errors in assertDataFrameEqual

2023-07-16 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44453:
--

 Summary: Use difflib to display errors in assertDataFrameEqual
 Key: SPARK-44453
 URL: https://issues.apache.org/jira/browse/SPARK-44453
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44446) Add checks for expected list type special cases

2023-07-16 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-6:
--

 Summary: Add checks for expected list type special cases
 Key: SPARK-6
 URL: https://issues.apache.org/jira/browse/SPARK-6
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44413) Clarify error for unsupported arg data type in assertDataFrameEqual

2023-07-13 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44413:
--

 Summary: Clarify error for unsupported arg data type in 
assertDataFrameEqual
 Key: SPARK-44413
 URL: https://issues.apache.org/jira/browse/SPARK-44413
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44216) Make assertSchemaEqual API public

2023-07-13 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44216:
---
Summary: Make assertSchemaEqual API public  (was: Make assertSchemaEqual 
API with ignore_nullable optional flag)

> Make assertSchemaEqual API public
> -
>
> Key: SPARK-44216
> URL: https://issues.apache.org/jira/browse/SPARK-44216
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44397) Expose assertDataFrameEqual in pyspark.testing.utils

2023-07-12 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44397:
--

 Summary: Expose assertDataFrameEqual in pyspark.testing.utils
 Key: SPARK-44397
 URL: https://issues.apache.org/jira/browse/SPARK-44397
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44217) Allow custom precision for fp approx equality

2023-07-11 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44217:
---
Summary: Allow custom precision for fp approx equality  (was: Add 
assert_approx_df_equality util function)

> Allow custom precision for fp approx equality
> -
>
> Key: SPARK-44217
> URL: https://issues.apache.org/jira/browse/SPARK-44217
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44216) Add assertSchemaEqual API with ignore_nullable optional flag

2023-07-10 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44216:
---
Summary: Add assertSchemaEqual API with ignore_nullable optional flag  
(was: Add improved error message formatting for assert_df_equality)

> Add assertSchemaEqual API with ignore_nullable optional flag
> 
>
> Key: SPARK-44216
> URL: https://issues.apache.org/jira/browse/SPARK-44216
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44216) Make assertSchemaEqual API with ignore_nullable optional flag

2023-07-10 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44216:
---
Summary: Make assertSchemaEqual API with ignore_nullable optional flag  
(was: Add assertSchemaEqual API with ignore_nullable optional flag)

> Make assertSchemaEqual API with ignore_nullable optional flag
> -
>
> Key: SPARK-44216
> URL: https://issues.apache.org/jira/browse/SPARK-44216
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44363) Display percent of unequal rows in DataFrame comparison

2023-07-10 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44363:
---
Summary: Display percent of unequal rows in DataFrame comparison  (was: 
Display percent of unequal rows in dataframe comparison)

> Display percent of unequal rows in DataFrame comparison
> ---
>
> Key: SPARK-44363
> URL: https://issues.apache.org/jira/browse/SPARK-44363
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44061) Add assertDataFrameEquality util function

2023-07-10 Thread Amanda Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amanda Liu updated SPARK-44061:
---
Summary: Add assertDataFrameEquality util function  (was: Add 
assert_df_equality util function)

> Add assertDataFrameEquality util function
> -
>
> Key: SPARK-44061
> URL: https://issues.apache.org/jira/browse/SPARK-44061
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Assignee: Amanda Liu
>Priority: Major
> Fix For: 3.5.0
>
>
> SPIP: 
> https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44364) Support List[Row] data type for expected DataFrame argument

2023-07-10 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44364:
--

 Summary: Support List[Row] data type for expected DataFrame 
argument
 Key: SPARK-44364
 URL: https://issues.apache.org/jira/browse/SPARK-44364
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44363) Display percent of unequal rows in dataframe comparison

2023-07-10 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44363:
--

 Summary: Display percent of unequal rows in dataframe comparison
 Key: SPARK-44363
 URL: https://issues.apache.org/jira/browse/SPARK-44363
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44357) Add pyspark_testing module for GHA tests

2023-07-10 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44357:
--

 Summary: Add pyspark_testing module for GHA tests
 Key: SPARK-44357
 URL: https://issues.apache.org/jira/browse/SPARK-44357
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44218) Add improved error message formatting for assert_approx_df_equality

2023-06-27 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44218:
--

 Summary: Add improved error message formatting for 
assert_approx_df_equality
 Key: SPARK-44218
 URL: https://issues.apache.org/jira/browse/SPARK-44218
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44217) Add assert_approx_df_equality util function

2023-06-27 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44217:
--

 Summary: Add assert_approx_df_equality util function
 Key: SPARK-44217
 URL: https://issues.apache.org/jira/browse/SPARK-44217
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44216) Add improved error message formatting for assert_df_equality

2023-06-27 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44216:
--

 Summary: Add improved error message formatting for 
assert_df_equality
 Key: SPARK-44216
 URL: https://issues.apache.org/jira/browse/SPARK-44216
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44042) SPIP: PySpark Test Framework

2023-06-21 Thread Amanda Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735747#comment-17735747
 ] 

Amanda Liu commented on SPARK-44042:


[~ste...@apache.org] thank you for the comment! i agree that test output 
messages are critical here. thanks also for the hadoop-api-shim example, that's 
helpful to look at

> SPIP: PySpark Test Framework
> 
>
> Key: SPARK-44042
> URL: https://issues.apache.org/jira/browse/SPARK-44042
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.5.0
>Reporter: Amanda Liu
>Priority: Major
>
> Currently, there's no official PySpark test framework, but only various 
> open-source repos and blog posts. Many of these open-source resources are 
> very popular, which demonstrates user-demand for PySpark testing 
> capabilities. 
> [spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k 
> stars, and [chispa|https://github.com/MrPowers/chispa] has 532k 
> downloads/month. However, it can be confusing for users to piece together 
> disparate resources to write their own PySpark tests (see [The Elephant in 
> the Room: How to Write PySpark 
> Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]).
>  We can streamline and simplify the testing process by incorporating test 
> features, such as a PySpark Test Base class (which allows tests to share 
> Spark sessions) and test util functions (for example, asserting dataframe and 
> schema equality). Please see the full SPIP document attached: 
> [https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44062) Add PySparkTestBase unit test class

2023-06-14 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44062:
--

 Summary: Add PySparkTestBase unit test class
 Key: SPARK-44062
 URL: https://issues.apache.org/jira/browse/SPARK-44062
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44061) Add assert_df_equality util function

2023-06-14 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44061:
--

 Summary: Add assert_df_equality util function
 Key: SPARK-44061
 URL: https://issues.apache.org/jira/browse/SPARK-44061
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


SPIP: 
https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44042) SPIP: PySpark Test Framework

2023-06-13 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-44042:
--

 Summary: SPIP: PySpark Test Framework
 Key: SPARK-44042
 URL: https://issues.apache.org/jira/browse/SPARK-44042
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Amanda Liu


Currently, there's no official PySpark test framework, but only various 
open-source repos and blog posts. Many of these open-source resources are very 
popular, which demonstrates user-demand for PySpark testing capabilities. 
[spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k 
stars, and [chispa|https://github.com/MrPowers/chispa] has 532k 
downloads/month. However, it can be confusing for users to piece together 
disparate resources to write their own PySpark tests (see [The Elephant in the 
Room: How to Write PySpark 
Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]).
 We can streamline and simplify the testing process by incorporating test 
features, such as a PySpark Test Base class (which allows tests to share Spark 
sessions) and test util functions (for example, asserting dataframe and schema 
equality). Please see the full SPIP document attached: 
[https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43940) Write test for CANNOT_FIND_BATCH (Prev _LEGACY_ERROR_TEMP_2132)

2023-06-01 Thread Amanda Liu (Jira)
Amanda Liu created SPARK-43940:
--

 Summary: Write test for CANNOT_FIND_BATCH (Prev 
_LEGACY_ERROR_TEMP_2132)
 Key: SPARK-43940
 URL: https://issues.apache.org/jira/browse/SPARK-43940
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Amanda Liu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org