[
https://issues.apache.org/jira/browse/SPARK-44546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amanda Liu updated SPARK-44546:
-------------------------------
Description:
h2. Summary
This ticket adds a dev utility script to help generate PySpark tests using LLM
response. The purpose of this experimental script is to encourage PySpark
developers to test their code thoroughly, to avoid introducing regressions in
the codebase.
Below, we outline some common edge case scenarios for PySpark DataFrame APIs,
from the perspective of arguments. Many of these edge cases are passed into the
LLM through the script's base prompt.
Please note that this list is not exhaustive, but rather a starting point. Some
of these cases may not apply, depending on the situation. We encourage all
PySpark developers to carefully consider edge case scenarios when writing tests.
h2. Table of Contents
# None
# Ints
# Floats
# Single column / column name
# Multi column / column names
# DataFrame argument
h3. 1. None
* Empty input
* None type
h3. 2. Ints
* Negatives
* 0
* value > Int.MaxValue
* value < Int.MinValue
h3. 3. Floats
* Negatives
* 0.0
* Float(“nan”)
* Float("inf")
* Float("-inf")
* decimal.Decimal
* numpy.float16
h3. 4. String
* Special characters
* Spaces
* Empty strings
h3. 5. Single column / column name
* Non-existent column
* Empty column name
* Column name with special characters, e.g. dots
* Multi columns with the same name
* Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
* Column of special types, e.g. nested type;
* Column containing special values, e.g. Null;
h3. 6. Multi column / column names
* Empty input; e.g DataFrame.drop()
* Special cases for each single column
* Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
* Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))
h3. 7. DataFrame argument
* DataFrame argument
* Empty dataframe; e.g. spark.range(5).limit(0)
* Dataframe with 0 columns, e.g. spark.range(5).drop('id')
* Dataset with repeated arguments
* Local dataset (pd.DataFrame) containing unsupported datatype
> Add a dev utility to generate PySpark tests with LLM
> ----------------------------------------------------
>
> Key: SPARK-44546
> URL: https://issues.apache.org/jira/browse/SPARK-44546
> Project: Spark
> Issue Type: New Feature
> Components: PySpark
> Affects Versions: 3.5.0
> Reporter: Amanda Liu
> Priority: Major
>
> h2. Summary
> This ticket adds a dev utility script to help generate PySpark tests using
> LLM response. The purpose of this experimental script is to encourage PySpark
> developers to test their code thoroughly, to avoid introducing regressions in
> the codebase.
> Below, we outline some common edge case scenarios for PySpark DataFrame APIs,
> from the perspective of arguments. Many of these edge cases are passed into
> the LLM through the script's base prompt.
> Please note that this list is not exhaustive, but rather a starting point.
> Some of these cases may not apply, depending on the situation. We encourage
> all PySpark developers to carefully consider edge case scenarios when writing
> tests.
> h2. Table of Contents
> # None
> # Ints
> # Floats
> # Single column / column name
> # Multi column / column names
> # DataFrame argument
> h3. 1. None
> * Empty input
> * None type
> h3. 2. Ints
> * Negatives
> * 0
> * value > Int.MaxValue
> * value < Int.MinValue
> h3. 3. Floats
> * Negatives
> * 0.0
> * Float(“nan”)
> * Float("inf")
> * Float("-inf")
> * decimal.Decimal
> * numpy.float16
> h3. 4. String
> * Special characters
> * Spaces
> * Empty strings
> h3. 5. Single column / column name
> * Non-existent column
> * Empty column name
> * Column name with special characters, e.g. dots
> * Multi columns with the same name
> * Nested column vs. quoted column name, e.g. ‘a.b.c’ vs ‘`a.b.c`’
> * Column of special types, e.g. nested type;
> * Column containing special values, e.g. Null;
> h3. 6. Multi column / column names
> * Empty input; e.g DataFrame.drop()
> * Special cases for each single column
> * Mix column with column names; e.g. DataFrame.drop(“col1”, df.col2, “col3”)
> * Duplicated columns; e.g. DataFrame.drop(“col1”, col(“col1”))
> h3. 7. DataFrame argument
> * DataFrame argument
> * Empty dataframe; e.g. spark.range(5).limit(0)
> * Dataframe with 0 columns, e.g. spark.range(5).drop('id')
> * Dataset with repeated arguments
> * Local dataset (pd.DataFrame) containing unsupported datatype
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]