[jira] [Updated] (SPARK-49145) Improve readability of log4j console log output
[ https://issues.apache.org/jira/browse/SPARK-49145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-49145: --- Description: Prior to this update, the OSS Spark logs were difficult to interpret. The logs followed a JSON output format which is not optimal for human consumption: {code:java} {"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac OS X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java version 17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No custom resources configured for spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted application: Spark Pi","context":{"app_name":"Spark Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start Jetty 0.0.0.0:4040 for SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20; built: 2024-01-29T21:04:22.394Z; git: 922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code} This issue updates the default `log4j.properties.template` with the following improvements for console logging format: * Use PatternLayout for improved human readability * Color-code log levels to simplify logging output * Visually partition the threadName and contextInfo for easy interpretation was: Prior to this update, the OSS Spark logs were difficult to interpret. The logs followed a JSON output format which is not optimal for human consumption: {code:java} {"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac OS X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java version 17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No custom resources configured for spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted application: Spark Pi","context":{"app_name":"Spark Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start Jetty 0.0.0.0:4040 for SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20; built: 2024-01-29T21:04:22.394Z; git: 922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code} This effort updates the default `log4j.properties.template` with the following improvements for console logging format: * Use PatternLayout for improved human readability * Color-code log levels to simplify logging output * Visually partition the threadName and contextInfo for easy interpretation > Improve readability of log4j console log output > --- > > Key: SPARK-49145 > URL: https://issues.apache.org/jira/browse/SPARK-49145 > Project: Spark > Issue Type: Task >
[jira] [Updated] (SPARK-49145) Improve readability of log4j console log output
[ https://issues.apache.org/jira/browse/SPARK-49145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-49145: --- Description: Prior to this update, the OSS Spark logs were difficult to interpret. The logs followed a JSON output format which is not optimal for human consumption: {code:java} {"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac OS X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java version 17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No custom resources configured for spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted application: Spark Pi","context":{"app_name":"Spark Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start Jetty 0.0.0.0:4040 for SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20; built: 2024-01-29T21:04:22.394Z; git: 922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code} This effort updates the default `log4j.properties.template` with the following improvements for console logging format: * Use PatternLayout for improved human readability * Color-code log levels to simplify logging output * Visually partition the threadName and contextInfo for easy interpretation was: Prior to this update, the OSS Spark logs were difficult to interpret. The logs followed a JSON output format which is not optimal for human consumption: {code:java} {"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac OS X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java version 17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No custom resources configured for spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted application: Spark Pi","context":{"app_name":"Spark Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start Jetty 0.0.0.0:4040 for SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20; built: 2024-01-29T21:04:22.394Z; git: 922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code} This PR updates the default `log4j.properties.template` with the following improvements for console logging format: * Use PatternLayout for improved human readability * Color-code log levels to simplify logging output * Visually partition the threadName and contextInfo for easy interpretation > Improve readability of log4j console log output > --- > > Key: SPARK-49145 > URL: https://issues.apache.org/jira/browse/SPARK-49145 > Project: Spark > Issue Type: Task >
[jira] [Created] (SPARK-49145) Improve readability of log4j console log output
Amanda Liu created SPARK-49145: -- Summary: Improve readability of log4j console log output Key: SPARK-49145 URL: https://issues.apache.org/jira/browse/SPARK-49145 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu Prior to this update, the OSS Spark logs were difficult to interpret. The logs followed a JSON output format which is not optimal for human consumption: {code:java} {"ts":"2024-07-26T18:45:17.712Z","level":"INFO","msg":"Running Spark version 4.0.0-SNAPSHOT","context":{"spark_version":"4.0.0-SNAPSHOT"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.715Z","level":"INFO","msg":"OS info Mac OS X, 14.4.1, aarch64","context":{"os_arch":"aarch64","os_name":"Mac OS X","os_version":"14.4.1"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.716Z","level":"INFO","msg":"Java version 17.0.11","context":{"java_version":"17.0.11"},"logger":"SparkContext"}{"ts":"2024-07-26T18:45:17.761Z","level":"WARN","msg":"Unable to load native-hadoop library for your platform... using builtin-java classes where applicable","logger":"NativeCodeLoader"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"No custom resources configured for spark.driver.","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.783Z","level":"INFO","msg":"==","logger":"ResourceUtils"}{"ts":"2024-07-26T18:45:17.784Z","level":"INFO","msg":"Submitted application: Spark Pi","context":{"app_name":"Spark Pi"},"logger":"SparkContext"}...{"ts":"2024-07-26T18:45:18.036Z","level":"INFO","msg":"Start Jetty 0.0.0.0:4040 for SparkUI","context":{"host":"0.0.0.0","port":"4040","server_name":"SparkUI"},"logger":"JettyUtils"}{"ts":"2024-07-26T18:45:18.044Z","level":"INFO","msg":"jetty-11.0.20; built: 2024-01-29T21:04:22.394Z; git: 922f8dc188f7011e60d0361de585fd4ac4d63064; jvm 17.0.11+9-LTS","logger":"Server"}{"ts":"2024-07-26T18:45:18.054Z","level":"INFO","msg":"Started Server@22c75c01{STARTING}[11.0.20,sto=3] @1114ms","logger":"Server"} {code} This PR updates the default `log4j.properties.template` with the following improvements for console logging format: * Use PatternLayout for improved human readability * Color-code log levels to simplify logging output * Visually partition the threadName and contextInfo for easy interpretation -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48759) Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4
[ https://issues.apache.org/jira/browse/SPARK-48759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-48759: --- Summary: Add migration doc for CREATE TABLE AS SELECT behavior change behavior change since Spark 3.4 (was: Add migration doc for CREATE TABLE behavior change behavior change since Spark 3.4) > Add migration doc for CREATE TABLE AS SELECT behavior change behavior change > since Spark 3.4 > > > Key: SPARK-48759 > URL: https://issues.apache.org/jira/browse/SPARK-48759 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Amanda Liu >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48759) Add migration doc for CREATE TABLE behavior change behavior change since Spark 3.4
Amanda Liu created SPARK-48759: -- Summary: Add migration doc for CREATE TABLE behavior change behavior change since Spark 3.4 Key: SPARK-48759 URL: https://issues.apache.org/jira/browse/SPARK-48759 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.4.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48740) Catch missing window specification error early
Amanda Liu created SPARK-48740: -- Summary: Catch missing window specification error early Key: SPARK-48740 URL: https://issues.apache.org/jira/browse/SPARK-48740 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu Before, aggregate queries containing a window function without a window specification (e.g. `PARTITION BY`) would return a non-descriptive internal error message: `org.apache.spark.sql.catalyst.analysis.UnresolvedException: [INTERNAL_ERROR] Invalid call to exprId on unresolved object SQLSTATE: XX000` This PR catches the user error early and returns a more accurate description of the issue: `Window specification is not defined in the WINDOW clause.` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48676) Structured Logging Framework Scala Style Migration [Part 2]
Amanda Liu created SPARK-48676: -- Summary: Structured Logging Framework Scala Style Migration [Part 2] Key: SPARK-48676 URL: https://issues.apache.org/jira/browse/SPARK-48676 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48632) Remove unused LogKeys
[ https://issues.apache.org/jira/browse/SPARK-48632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu resolved SPARK-48632. Resolution: Not A Problem > Remove unused LogKeys > - > > Key: SPARK-48632 > URL: https://issues.apache.org/jira/browse/SPARK-48632 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Amanda Liu >Priority: Major > Labels: pull-request-available > > Remove unused LogKey objects to clean up LogKey.scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48632) Remove unused LogKeys
Amanda Liu created SPARK-48632: -- Summary: Remove unused LogKeys Key: SPARK-48632 URL: https://issues.apache.org/jira/browse/SPARK-48632 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu Remove unused LogKey objects to clean up LogKey.scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48623) Structured Logging Framework Scala Style Migration
Amanda Liu created SPARK-48623: -- Summary: Structured Logging Framework Scala Style Migration Key: SPARK-48623 URL: https://issues.apache.org/jira/browse/SPARK-48623 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48592) Add scala style check for logging message inline variables
Amanda Liu created SPARK-48592: -- Summary: Add scala style check for logging message inline variables Key: SPARK-48592 URL: https://issues.apache.org/jira/browse/SPARK-48592 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Amanda Liu Ban logging messages using logInfo, logWarning, logError containing variables without {{MDC}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46910) Eliminate JDK Requirement in PySpark Installation
Amanda Liu created SPARK-46910: -- Summary: Eliminate JDK Requirement in PySpark Installation Key: SPARK-46910 URL: https://issues.apache.org/jira/browse/SPARK-46910 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu PySpark requires users to have the correct JDK version (JDK 8+ for Spark<4; JDK 17+ for Spark>=4) installed locally. We can make the Spark installation script install the JDK, so users don’t need to do this step manually. h1. Details # When the entry point for a Spark class is invoked, the spark-class script checks if Java is installed in the user environment. # If Java is not installed, the user is prompted to select whether they want to install JDK 17. # If the user selects yes, JDK 17 is installed (using the [install-jdk library|https://pypi.org/project/install-jdk/]) and JAVA_HOME variable and RUNNER are set appropriately. The Spark build will now work! # If the user selects no, we provide them a brief description of how to install JDK manually. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45729) Fix PySpark testing guide links
Amanda Liu created SPARK-45729: -- Summary: Fix PySpark testing guide links Key: SPARK-45729 URL: https://issues.apache.org/jira/browse/SPARK-45729 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44712) Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44712: --- Description: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176] (was: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46]) > Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual > - > > Key: SPARK-44712 > URL: https://issues.apache.org/jira/browse/SPARK-44712 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > Migrate assert_eq to assertDataFrameEqual in this file: > [python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py|https://github.com/databricks/runtime/blob/f579860299b16f6614f70c7cf2509cd89816d363/python/pyspark/pandas/tests/data_type_ops/test_timedelta_ops.py#L176] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44712) Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44712: -- Summary: Migrate test_timedelta_ops assert_eq to use assertDataFrameEqual Key: SPARK-44712 URL: https://issues.apache.org/jira/browse/SPARK-44712 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44711: --- Description: Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63] was:Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] > Migrate test_series_conversion assert_eq to use assertDataFrameEqual > > > Key: SPARK-44711 > URL: https://issues.apache.org/jira/browse/SPARK-44711 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > Migrate assert_eq to assertDataFrameEqual in this file: > [python/pyspark/pandas/tests/test_series_conversion.py|https://github.com/databricks/runtime/blob/d162bd182de8bcea180d874027edb86ae8fc60e5/python/pyspark/pandas/tests/test_series_conversion.py#L63] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44711) Migrate test_series_conversion assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44711: -- Summary: Migrate test_series_conversion assert_eq to use assertDataFrameEqual Key: SPARK-44711 URL: https://issues.apache.org/jira/browse/SPARK-44711 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44708) Migrate test_reset_index assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44708: -- Summary: Migrate test_reset_index assert_eq to use assertDataFrameEqual Key: SPARK-44708 URL: https://issues.apache.org/jira/browse/SPARK-44708 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate assert_eq to assertDataFrameEqual in this file: [python/pyspark/pandas/tests/indexes/test_reset_index.py|https://github.com/apache/spark/blob/42e5daddf3ba16ff7d08e82e51cd8924cc56e180/python/pyspark/pandas/tests/indexes/test_reset_index.py#L46] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44597: --- Description: Migrate tests to new test utils in this file: python/pyspark/pandas/tests/test_sql.py (was: The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils.) > Migrate test_sql assert_eq to use assertDataFrameEqual > -- > > Key: SPARK-44597 > URL: https://issues.apache.org/jira/browse/SPARK-44597 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > Migrate tests to new test utils in this file: > python/pyspark/pandas/tests/test_sql.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils
[ https://issues.apache.org/jira/browse/SPARK-44589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44589: --- Description: The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils. was:Migrate existing tests in the PySpark codebase to use the new PySpark test utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042 > Migrate PySpark tests to use PySpark built-in test utils > > > Key: SPARK-44589 > URL: https://issues.apache.org/jira/browse/SPARK-44589 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > The Jira ticket SPARK-44042 SPIP: PySpark Test Framework introduces a new > PySpark test framework. Some of the user-facing testing util APIs include > assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. > With the new testing framework, we should migrate old tests in the Spark > codebase to use the new testing utils. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44682) Make pandas error class message_parameters strings
Amanda Liu created SPARK-44682: -- Summary: Make pandas error class message_parameters strings Key: SPARK-44682 URL: https://issues.apache.org/jira/browse/SPARK-44682 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44548) Add support for pandas-on-Spark DataFrame assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44548: --- Summary: Add support for pandas-on-Spark DataFrame assertDataFrameEqual (was: Add support for pandas DataFrame assertDataFrameEqual) > Add support for pandas-on-Spark DataFrame assertDataFrameEqual > -- > > Key: SPARK-44548 > URL: https://issues.apache.org/jira/browse/SPARK-44548 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44665) Add support for pandas DataFrame assertDataFrameEqual
Amanda Liu created SPARK-44665: -- Summary: Add support for pandas DataFrame assertDataFrameEqual Key: SPARK-44665 URL: https://issues.apache.org/jira/browse/SPARK-44665 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44652) Raise error when only one df is None
Amanda Liu created SPARK-44652: -- Summary: Raise error when only one df is None Key: SPARK-44652 URL: https://issues.apache.org/jira/browse/SPARK-44652 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44645) Update assertDataFrameEqual docs error example output
[ https://issues.apache.org/jira/browse/SPARK-44645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44645: --- Summary: Update assertDataFrameEqual docs error example output (was: Update assertDataFrame docs error example output) > Update assertDataFrameEqual docs error example output > - > > Key: SPARK-44645 > URL: https://issues.apache.org/jira/browse/SPARK-44645 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44645) Update assertDataFrame docs error example output
Amanda Liu created SPARK-44645: -- Summary: Update assertDataFrame docs error example output Key: SPARK-44645 URL: https://issues.apache.org/jira/browse/SPARK-44645 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44629) Publish PySpark Test Guidelines webpage
Amanda Liu created SPARK-44629: -- Summary: Publish PySpark Test Guidelines webpage Key: SPARK-44629 URL: https://issues.apache.org/jira/browse/SPARK-44629 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44617) Support comparison between list of Rows
Amanda Liu created SPARK-44617: -- Summary: Support comparison between list of Rows Key: SPARK-44617 URL: https://issues.apache.org/jira/browse/SPARK-44617 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44617) Support comparison between lists of Rows
[ https://issues.apache.org/jira/browse/SPARK-44617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44617: --- Summary: Support comparison between lists of Rows (was: Support comparison between list of Rows) > Support comparison between lists of Rows > > > Key: SPARK-44617 > URL: https://issues.apache.org/jira/browse/SPARK-44617 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44597: --- Description: The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework |https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils. > Migrate test_sql assert_eq to use assertDataFrameEqual > -- > > Key: SPARK-44597 > URL: https://issues.apache.org/jira/browse/SPARK-44597 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework > |https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark > test framework. Some of the user-facing testing util APIs include > assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. > > With the new testing framework, we should migrate old tests in the Spark > codebase to use the new testing utils. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual
[ https://issues.apache.org/jira/browse/SPARK-44597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44597: --- Description: The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils. was: The Jira ticket [[SPARK-44042] SPIP: PySpark Test Framework |https://issues.apache.org/jira/browse/SPARK-44042] introduces a new PySpark test framework. Some of the user-facing testing util APIs include assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. With the new testing framework, we should migrate old tests in the Spark codebase to use the new testing utils. > Migrate test_sql assert_eq to use assertDataFrameEqual > -- > > Key: SPARK-44597 > URL: https://issues.apache.org/jira/browse/SPARK-44597 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0, 4.0.0 > > > The Jira ticket [SPARK-44042] SPIP: PySpark Test Framework introduces a new > PySpark test framework. Some of the user-facing testing util APIs include > assertDataFrameEqual, assertSchemaEqual, and assertPandasOnSparkEqual. > With the new testing framework, we should migrate old tests in the Spark > codebase to use the new testing utils. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44603) Add pyspark.testing to setup.py
[ https://issues.apache.org/jira/browse/SPARK-44603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44603: --- Summary: Add pyspark.testing to setup.py (was: Add pyspark.testing.utils to setup.py) > Add pyspark.testing to setup.py > --- > > Key: SPARK-44603 > URL: https://issues.apache.org/jira/browse/SPARK-44603 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44603) Add pyspark.testing.utils to Python Setup
Amanda Liu created SPARK-44603: -- Summary: Add pyspark.testing.utils to Python Setup Key: SPARK-44603 URL: https://issues.apache.org/jira/browse/SPARK-44603 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44603) Add pyspark.testing.utils to setup.py
[ https://issues.apache.org/jira/browse/SPARK-44603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44603: --- Summary: Add pyspark.testing.utils to setup.py (was: Add pyspark.testing.utils to Python Setup) > Add pyspark.testing.utils to setup.py > - > > Key: SPARK-44603 > URL: https://issues.apache.org/jira/browse/SPARK-44603 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44597) Migrate test_sql assert_eq to use assertDataFrameEqual
Amanda Liu created SPARK-44597: -- Summary: Migrate test_sql assert_eq to use assertDataFrameEqual Key: SPARK-44597 URL: https://issues.apache.org/jira/browse/SPARK-44597 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44596) Fix pandas-on-Spark type checks for assertDataFrameEqual
Amanda Liu created SPARK-44596: -- Summary: Fix pandas-on-Spark type checks for assertDataFrameEqual Key: SPARK-44596 URL: https://issues.apache.org/jira/browse/SPARK-44596 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Assignee: Amanda Liu Fix For: 3.5.0, 4.0.0 SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44589) Migrate PySpark tests to use PySpark built-in test utils
Amanda Liu created SPARK-44589: -- Summary: Migrate PySpark tests to use PySpark built-in test utils Key: SPARK-44589 URL: https://issues.apache.org/jira/browse/SPARK-44589 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Migrate existing tests in the PySpark codebase to use the new PySpark test utils, outlined here: https://issues.apache.org/jira/browse/SPARK-44042 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44218) Customize diff log in assertDataFrameEqual error message format
[ https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44218: --- Summary: Customize diff log in assertDataFrameEqual error message format (was: Customize context_diff in assertDataFrameEqual error message format) > Customize diff log in assertDataFrameEqual error message format > --- > > Key: SPARK-44218 > URL: https://issues.apache.org/jira/browse/SPARK-44218 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44218) Customize context_diff in assertDataFrameEqual error message format
[ https://issues.apache.org/jira/browse/SPARK-44218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44218: --- Summary: Customize context_diff in assertDataFrameEqual error message format (was: Add improved error message formatting for assert_approx_df_equality) > Customize context_diff in assertDataFrameEqual error message format > --- > > Key: SPARK-44218 > URL: https://issues.apache.org/jira/browse/SPARK-44218 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs (see [https://databricks.atlassian.net/browse/ES-705815]). Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is
[jira] [Created] (SPARK-44548) Add support for pandas DataFrame assertDataFrameEqual
Amanda Liu created SPARK-44548: -- Summary: Add support for pandas DataFrame assertDataFrameEqual Key: SPARK-44548 URL: https://issues.apache.org/jira/browse/SPARK-44548 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Assignee: Amanda Liu Fix For: 3.5.0, 4.0.0 SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Historically, PySpark has had code regressions due to insufficient testing of public APIs (see [https://databricks.atlassian.net/browse/ES-705815]). Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Strings # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. Strings * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # String # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # String # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype was: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many of these edge
[jira] [Updated] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
[ https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44546: --- Description: h2. Summary This ticket adds a dev utility script to help generate PySpark tests using LLM response. The purpose of this experimental script is to encourage PySpark developers to test their code thoroughly, to avoid introducing regressions in the codebase. Below, we outline some common edge case scenarios for PySpark DataFrame APIs, from the perspective of arguments. Many of these edge cases are passed into the LLM through the script's base prompt. Please note that this list is not exhaustive, but rather a starting point. Some of these cases may not apply, depending on the situation. We encourage all PySpark developers to carefully consider edge case scenarios when writing tests. h2. Table of Contents # None # Ints # Floats # Single column / column name # Multi column / column names # DataFrame argument h3. 1. None * Empty input * None type h3. 2. Ints * Negatives * 0 * value > Int.MaxValue * value < Int.MinValue h3. 3. Floats * Negatives * 0.0 * Float(“nan”) * Float("inf") * Float("-inf") * decimal.Decimal * numpy.float16 h3. 4. String * Special characters * Spaces * Empty strings h3. 5. Single column / column name * Non-existent column * Empty column name * Column name with special characters, e.g. dots * Multi columns with the same name * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ * Column of special types, e.g. nested type; * Column containing special values, e.g. Null; h3. 6. Multi column / column names * Empty input; e.g DataFrame.drop() * Special cases for each single column * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) h3. 7. DataFrame argument * DataFrame argument * Empty dataframe; e.g. spark.range(5).limit(0) * Dataframe with 0 columns, e.g. spark.range(5).drop('id') * Dataset with repeated arguments * Local dataset (pd.DataFrame) containing unsupported datatype > Add a dev utility to generate PySpark tests with LLM > > > Key: SPARK-44546 > URL: https://issues.apache.org/jira/browse/SPARK-44546 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > h2. Summary > This ticket adds a dev utility script to help generate PySpark tests using > LLM response. The purpose of this experimental script is to encourage PySpark > developers to test their code thoroughly, to avoid introducing regressions in > the codebase. > Below, we outline some common edge case scenarios for PySpark DataFrame APIs, > from the perspective of arguments. Many of these edge cases are passed into > the LLM through the script's base prompt. > Please note that this list is not exhaustive, but rather a starting point. > Some of these cases may not apply, depending on the situation. We encourage > all PySpark developers to carefully consider edge case scenarios when writing > tests. > h2. Table of Contents > # None > # Ints > # Floats > # Single column / column name > # Multi column / column names > # DataFrame argument > h3. 1. None > * Empty input > * None type > h3. 2. Ints > * Negatives > * 0 > * value > Int.MaxValue > * value < Int.MinValue > h3. 3. Floats > * Negatives > * 0.0 > * Float(“nan”) > * Float("inf") > * Float("-inf") > * decimal.Decimal > * numpy.float16 > h3. 4. String > * Special characters > * Spaces > * Empty strings > h3. 5. Single column / column name > * Non-existent column > * Empty column name > * Column name with special characters, e.g. dots > * Multi columns with the same name > * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’ > * Column of special types, e.g. nested type; > * Column containing special values, e.g. Null; > h3. 6. Multi column / column names > * Empty input; e.g DataFrame.drop() > * Special cases for each single column > * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”) > * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”)) > h3. 7. DataFrame argument > * DataFrame argument > * Empty dataframe; e.g. spark.range(5).limit(0) > * Dataframe with 0 columns, e.g. spark.range(5).drop('id') > * Dataset with repeated arguments > * Local dataset (pd.DataFrame) containing unsupported datatype -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44546) Add a dev utility to generate PySpark tests with LLM
Amanda Liu created SPARK-44546: -- Summary: Add a dev utility to generate PySpark tests with LLM Key: SPARK-44546 URL: https://issues.apache.org/jira/browse/SPARK-44546 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44061) Add assertDataFrameEqual util function
[ https://issues.apache.org/jira/browse/SPARK-44061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44061: --- Summary: Add assertDataFrameEqual util function (was: Add assertDataFrameEquality util function) > Add assertDataFrameEqual util function > -- > > Key: SPARK-44061 > URL: https://issues.apache.org/jira/browse/SPARK-44061 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0 > > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44453) Use difflib to display errors in assertDataFrameEqual
Amanda Liu created SPARK-44453: -- Summary: Use difflib to display errors in assertDataFrameEqual Key: SPARK-44453 URL: https://issues.apache.org/jira/browse/SPARK-44453 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44446) Add checks for expected list type special cases
Amanda Liu created SPARK-6: -- Summary: Add checks for expected list type special cases Key: SPARK-6 URL: https://issues.apache.org/jira/browse/SPARK-6 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44413) Clarify error for unsupported arg data type in assertDataFrameEqual
Amanda Liu created SPARK-44413: -- Summary: Clarify error for unsupported arg data type in assertDataFrameEqual Key: SPARK-44413 URL: https://issues.apache.org/jira/browse/SPARK-44413 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44216) Make assertSchemaEqual API public
[ https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44216: --- Summary: Make assertSchemaEqual API public (was: Make assertSchemaEqual API with ignore_nullable optional flag) > Make assertSchemaEqual API public > - > > Key: SPARK-44216 > URL: https://issues.apache.org/jira/browse/SPARK-44216 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44397) Expose assertDataFrameEqual in pyspark.testing.utils
Amanda Liu created SPARK-44397: -- Summary: Expose assertDataFrameEqual in pyspark.testing.utils Key: SPARK-44397 URL: https://issues.apache.org/jira/browse/SPARK-44397 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44217) Allow custom precision for fp approx equality
[ https://issues.apache.org/jira/browse/SPARK-44217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44217: --- Summary: Allow custom precision for fp approx equality (was: Add assert_approx_df_equality util function) > Allow custom precision for fp approx equality > - > > Key: SPARK-44217 > URL: https://issues.apache.org/jira/browse/SPARK-44217 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44216) Add assertSchemaEqual API with ignore_nullable optional flag
[ https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44216: --- Summary: Add assertSchemaEqual API with ignore_nullable optional flag (was: Add improved error message formatting for assert_df_equality) > Add assertSchemaEqual API with ignore_nullable optional flag > > > Key: SPARK-44216 > URL: https://issues.apache.org/jira/browse/SPARK-44216 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44216) Make assertSchemaEqual API with ignore_nullable optional flag
[ https://issues.apache.org/jira/browse/SPARK-44216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44216: --- Summary: Make assertSchemaEqual API with ignore_nullable optional flag (was: Add assertSchemaEqual API with ignore_nullable optional flag) > Make assertSchemaEqual API with ignore_nullable optional flag > - > > Key: SPARK-44216 > URL: https://issues.apache.org/jira/browse/SPARK-44216 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44363) Display percent of unequal rows in DataFrame comparison
[ https://issues.apache.org/jira/browse/SPARK-44363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44363: --- Summary: Display percent of unequal rows in DataFrame comparison (was: Display percent of unequal rows in dataframe comparison) > Display percent of unequal rows in DataFrame comparison > --- > > Key: SPARK-44363 > URL: https://issues.apache.org/jira/browse/SPARK-44363 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44061) Add assertDataFrameEquality util function
[ https://issues.apache.org/jira/browse/SPARK-44061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amanda Liu updated SPARK-44061: --- Summary: Add assertDataFrameEquality util function (was: Add assert_df_equality util function) > Add assertDataFrameEquality util function > - > > Key: SPARK-44061 > URL: https://issues.apache.org/jira/browse/SPARK-44061 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Assignee: Amanda Liu >Priority: Major > Fix For: 3.5.0 > > > SPIP: > https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44364) Support List[Row] data type for expected DataFrame argument
Amanda Liu created SPARK-44364: -- Summary: Support List[Row] data type for expected DataFrame argument Key: SPARK-44364 URL: https://issues.apache.org/jira/browse/SPARK-44364 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44363) Display percent of unequal rows in dataframe comparison
Amanda Liu created SPARK-44363: -- Summary: Display percent of unequal rows in dataframe comparison Key: SPARK-44363 URL: https://issues.apache.org/jira/browse/SPARK-44363 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44357) Add pyspark_testing module for GHA tests
Amanda Liu created SPARK-44357: -- Summary: Add pyspark_testing module for GHA tests Key: SPARK-44357 URL: https://issues.apache.org/jira/browse/SPARK-44357 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44218) Add improved error message formatting for assert_approx_df_equality
Amanda Liu created SPARK-44218: -- Summary: Add improved error message formatting for assert_approx_df_equality Key: SPARK-44218 URL: https://issues.apache.org/jira/browse/SPARK-44218 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44217) Add assert_approx_df_equality util function
Amanda Liu created SPARK-44217: -- Summary: Add assert_approx_df_equality util function Key: SPARK-44217 URL: https://issues.apache.org/jira/browse/SPARK-44217 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44216) Add improved error message formatting for assert_df_equality
Amanda Liu created SPARK-44216: -- Summary: Add improved error message formatting for assert_df_equality Key: SPARK-44216 URL: https://issues.apache.org/jira/browse/SPARK-44216 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44042) SPIP: PySpark Test Framework
[ https://issues.apache.org/jira/browse/SPARK-44042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735747#comment-17735747 ] Amanda Liu commented on SPARK-44042: [~ste...@apache.org] thank you for the comment! i agree that test output messages are critical here. thanks also for the hadoop-api-shim example, that's helpful to look at > SPIP: PySpark Test Framework > > > Key: SPARK-44042 > URL: https://issues.apache.org/jira/browse/SPARK-44042 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 3.5.0 >Reporter: Amanda Liu >Priority: Major > > Currently, there's no official PySpark test framework, but only various > open-source repos and blog posts. Many of these open-source resources are > very popular, which demonstrates user-demand for PySpark testing > capabilities. > [spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k > stars, and [chispa|https://github.com/MrPowers/chispa] has 532k > downloads/month. However, it can be confusing for users to piece together > disparate resources to write their own PySpark tests (see [The Elephant in > the Room: How to Write PySpark > Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]). > We can streamline and simplify the testing process by incorporating test > features, such as a PySpark Test Base class (which allows tests to share > Spark sessions) and test util functions (for example, asserting dataframe and > schema equality). Please see the full SPIP document attached: > [https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44062) Add PySparkTestBase unit test class
Amanda Liu created SPARK-44062: -- Summary: Add PySparkTestBase unit test class Key: SPARK-44062 URL: https://issues.apache.org/jira/browse/SPARK-44062 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44061) Add assert_df_equality util function
Amanda Liu created SPARK-44061: -- Summary: Add assert_df_equality util function Key: SPARK-44061 URL: https://issues.apache.org/jira/browse/SPARK-44061 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu SPIP: https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44042) SPIP: PySpark Test Framework
Amanda Liu created SPARK-44042: -- Summary: SPIP: PySpark Test Framework Key: SPARK-44042 URL: https://issues.apache.org/jira/browse/SPARK-44042 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.5.0 Reporter: Amanda Liu Currently, there's no official PySpark test framework, but only various open-source repos and blog posts. Many of these open-source resources are very popular, which demonstrates user-demand for PySpark testing capabilities. [spark-testing-base|https://github.com/holdenk/spark-testing-base] has 1.4k stars, and [chispa|https://github.com/MrPowers/chispa] has 532k downloads/month. However, it can be confusing for users to piece together disparate resources to write their own PySpark tests (see [The Elephant in the Room: How to Write PySpark Tests|https://towardsdatascience.com/the-elephant-in-the-room-how-to-write-pyspark-unit-tests-a5073acabc34]). We can streamline and simplify the testing process by incorporating test features, such as a PySpark Test Base class (which allows tests to share Spark sessions) and test util functions (for example, asserting dataframe and schema equality). Please see the full SPIP document attached: [https://docs.google.com/document/d/1OkyBn3JbEHkkQgSQ45Lq82esXjr9rm2Vj7Ih_4zycRc/edit#heading=h.f5f0u2riv07v]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43940) Write test for CANNOT_FIND_BATCH (Prev _LEGACY_ERROR_TEMP_2132)
Amanda Liu created SPARK-43940: -- Summary: Write test for CANNOT_FIND_BATCH (Prev _LEGACY_ERROR_TEMP_2132) Key: SPARK-43940 URL: https://issues.apache.org/jira/browse/SPARK-43940 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Amanda Liu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org