[ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
---------------------------------
    Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": [{"kind": "range", "name": null, "start": '
              b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
              b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
              b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
              b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
              b'mpy_type": "int64", "metadata": null}], "creator": {"library'
              b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert. Ideally, we should try to put udf 
differently for each statement.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>

```diff
...  # here you put 'git diff' results
```

</p>
</details>
{code}
10. You're ready. Please go for a PR! See 
https://github.com/apache/spark/pull/25069 as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
Note that this guide is supposed to be updated continuously given how it goes.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
pan>>>
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": [{"kind": "range", "name": null, "start": '
              b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
              b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
              b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
              b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
              b'mpy_type": "int64", "metadata": null}], "creator": {"library'
              b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.

3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert {{udf(...)}} into each statement. It is not required to add more 
combinations.
 And it is not strict about where to insert. Ideally, we should try to put udf 
differently for each statement.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>

```diff
...  # here you put 'git diff' results
```

</p>
</details>
{code}
10. You're ready. Please go for a PR! See 
https://github.com/apache/spark/pull/25069 as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
Note that this JIRA targets plan specific cases in general.
Note that one {{output.sql.out}} file is shared for three UDF test cases (Scala 
UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
Note that this guide is supposed to be updated continuously given how it goes.


> Convert applicable *.sql tests into UDF integrated test base
> ------------------------------------------------------------
>
>                 Key: SPARK-27921
>                 URL: https://issues.apache.org/jira/browse/SPARK-27921
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark, SQL
>    Affects Versions: 3.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
> you're able to do this:
> {code:java}
> >>> import pandas
> >>> pandas.__version__
> '0.23.4'
> >>> import pyarrow
> >>> pyarrow.__version__
> '0.13.0'
> >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
> pyarrow.Table
> a: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": [{"kind": "range", "name": null, "start": '
>               b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
>               b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>               b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>               b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
>               b'mpy_type": "int64", "metadata": null}], "creator": {"library'
>               b'": "pyarrow", "version": "0.13.0"}, "pandas_version": 
> null}')])
> {code}
>  
>  1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
> file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from 
> {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert {{udf(...)}} into each statement. It is not required to add more 
> combinations.
>  And it is not strict about where to insert. Ideally, we should try to put 
> udf differently for each statement.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> # or git diff --no-index 
> sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
> sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
> {code}
> 6. Compare results with original file, 
> {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}
> 7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
> comments.
> 8. Run without generating golden files and check:
> {code:java}
> build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
> {code}
> 9. When you open a PR. please attach {{git diff --no-index 
> sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
> sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
> description with the template below:
> {code:java}
> <details><summary>Diff comparing to 'xxx.sql'</summary>
> <p>
> ```diff
> ...  # here you put 'git diff' results
> ```
> </p>
> </details>
> {code}
> 10. You're ready. Please go for a PR! See 
> https://github.com/apache/spark/pull/25069 as an example.
> Note that registered UDFs all return strings - so there are some differences 
> are expected.
> Note that this JIRA targets plan specific cases in general.
> Note that one {{output.sql.out}} file is shared for three UDF test cases 
> (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
> Note that this guide is supposed to be updated continuously given how it goes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to