[ 
https://issues.apache.org/jira/browse/SPARK-27921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27921:
---------------------------------
    Description: 
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": [{"kind": "range", "name": null, "start": '
              b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
              b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
              b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
              b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
              b'mpy_type": "int64", "metadata": null}], "creator": {"library'
              b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.
 For instance, let's add a comment as below on the top:
{code:java}
-- This test file was converted from xxx.sql.
{code}
3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert one or multiple {{udf(...)}} into each statement. It is not required 
to add more combinations.
 And it is not strict about where to insert. Ideally, we should try to put udf 
differently for each statement.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
comments. Please see [this 
comment|https://github.com/apache/spark/pull/25090#discussion_r301880585] when 
you file a JIRA.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>

```diff
...  # here you put 'git diff' results
```

</p>
</details>
{code}
10. You're ready. Please go for a PR! If the PR contains other minor fixes, use 
{{[SPARK-XXXXX][SQL][PYTHON]}} prefix in the PR title. If the PR is purely 
about tests, use {{[SPARK-XXXXX][SQL][PYTHON][TESTS]}}.
 See [https://github.com/apache/spark/pull/25069] as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.
 Note that one {{output.sql.out}} file is shared for three UDF test cases 
(Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
 Note that this guide is supposed to be updated continuously given how it goes.
 Note that this test case uses the integrated UDF test base. See 
[https://github.com/apache/spark/pull/24752] if you're interested in it or find 
an issue.

  was:
This JIRA targets to improve Python test coverage in particular about 
{{ExtractPythonUDFs}}.
 This rule has caused many regressions or issues such as SPARK-27803, 
SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
 We should convert *.sql test cases that can be affected by this rule 
{{ExtractPythonUDFs}} like 
[https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
 Namely most of plan related test cases might have to be converted.

*Here is the rough contribution guide to follow:*

Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
you're able to do this:
{code:java}
>>> import pandas
>>> pandas.__version__
'0.23.4'
>>> import pyarrow
>>> pyarrow.__version__
'0.13.0'
>>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
pyarrow.Table
a: int64
metadata
--------
OrderedDict([(b'pandas',
              b'{"index_columns": [{"kind": "range", "name": null, "start": '
              b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
              b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
              b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
              b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
              b'mpy_type": "int64", "metadata": null}], "creator": {"library'
              b'": "pyarrow", "version": "0.13.0"}, "pandas_version": null}')])
{code}
 
 1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}

2. Keep the comments and state that this file was copied from 
{{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.
 For instance, let's add a comment as below on the top:
{code:java}
-- This test file was converted from xxx.sql.
{code}
3. Run it below:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git add .
{code}
4. Insert one or multiple \{{udf(...)}} into each statement. It is not required 
to add more combinations.
 And it is not strict about where to insert. Ideally, we should try to put udf 
differently for each statement.

5. Run it below again:
{code:java}
SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z 
udf/udf-xxx.sql"
git diff
# or git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
{code}
6. Compare results with original file, 
{{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}

7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
comments.

8. Run without generating golden files and check:
{code:java}
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
{code}
9. When you open a PR. please attach {{git diff --no-index 
sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
description with the template below:
{code:java}
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>

```diff
...  # here you put 'git diff' results
```

</p>
</details>
{code}
10. You're ready. Please go for a PR! If the PR contains other minor fixes, use 
{{[SPARK-XXXXX][SQL][PYTHON]}} prefix in the PR title. If the PR is purely 
about tests, use {{[SPARK-XXXXX][SQL][PYTHON][TESTS]}}.
 See [https://github.com/apache/spark/pull/25069] as an example.

Note that registered UDFs all return strings - so there are some differences 
are expected.
 Note that this JIRA targets plan specific cases in general.
 Note that one {{output.sql.out}} file is shared for three UDF test cases 
(Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
 Note that this guide is supposed to be updated continuously given how it goes.
 Note that this test case uses the integrated UDF test base. See 
[https://github.com/apache/spark/pull/24752] if you're interested in it or find 
an issue.


> Convert applicable *.sql tests into UDF integrated test base
> ------------------------------------------------------------
>
>                 Key: SPARK-27921
>                 URL: https://issues.apache.org/jira/browse/SPARK-27921
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark, SQL, Tests
>    Affects Versions: 3.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Major
>
> This JIRA targets to improve Python test coverage in particular about 
> {{ExtractPythonUDFs}}.
>  This rule has caused many regressions or issues such as SPARK-27803, 
> SPARK-26147, SPARK-26864, SPARK-26293, SPARK-25314 and SPARK-24721.
>  We should convert *.sql test cases that can be affected by this rule 
> {{ExtractPythonUDFs}} like 
> [https://github.com/apache/spark/blob/f5317f10b25bd193cf5026a8f4fd1cd1ded8f5b4/sql/core/src/test/resources/sql-tests/inputs/udf/udf-inner-join.sql]
>  Namely most of plan related test cases might have to be converted.
> *Here is the rough contribution guide to follow:*
> Make sure you have Python with Pandas 0.23.2+ and PyArrow 0.12.1+. Check if 
> you're able to do this:
> {code:java}
> >>> import pandas
> >>> pandas.__version__
> '0.23.4'
> >>> import pyarrow
> >>> pyarrow.__version__
> '0.13.0'
> >>> pyarrow.Table.from_pandas(pandas.DataFrame({'a': [1,2,3]}))
> pyarrow.Table
> a: int64
> metadata
> --------
> OrderedDict([(b'pandas',
>               b'{"index_columns": [{"kind": "range", "name": null, "start": '
>               b'0, "stop": 3, "step": 1}], "column_indexes": [{"name": null,'
>               b' "field_name": null, "pandas_type": "unicode", "numpy_type":'
>               b' "object", "metadata": {"encoding": "UTF-8"}}], "columns": ['
>               b'{"name": "a", "field_name": "a", "pandas_type": "int64", "nu'
>               b'mpy_type": "int64", "metadata": null}], "creator": {"library'
>               b'": "pyarrow", "version": "0.13.0"}, "pandas_version": 
> null}')])
> {code}
>  
>  1. Copy and paste {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}} 
> file into {{sql/core/src/test/resources/sql-tests/inputs/udf/udf-xxx.sql}}
> 2. Keep the comments and state that this file was copied from 
> {{sql/core/src/test/resources/sql-tests/inputs/xxx.sql}}, for now.
>  For instance, let's add a comment as below on the top:
> {code:java}
> -- This test file was converted from xxx.sql.
> {code}
> 3. Run it below:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git add .
> {code}
> 4. Insert one or multiple {{udf(...)}} into each statement. It is not 
> required to add more combinations.
>  And it is not strict about where to insert. Ideally, we should try to put 
> udf differently for each statement.
> 5. Run it below again:
> {code:java}
> SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- 
> -z udf/udf-xxx.sql"
> git diff
> # or git diff --no-index 
> sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
> sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out
> {code}
> 6. Compare results with original file, 
> {{sql/core/src/test/resources/sql-tests/results/xxx.sql.out}}
> 7. If there are diff, analyze it, file or find the JIRA, skip the tests with 
> comments. Please see [this 
> comment|https://github.com/apache/spark/pull/25090#discussion_r301880585] 
> when you file a JIRA.
> 8. Run without generating golden files and check:
> {code:java}
> build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/udf-xxx.sql"
> {code}
> 9. When you open a PR. please attach {{git diff --no-index 
> sql/core/src/test/resources/sql-tests/results/xxx.sql.out 
> sql/core/src/test/resources/sql-tests/results/udf/xxx.sql.out}} in the PR 
> description with the template below:
> {code:java}
> <details><summary>Diff comparing to 'xxx.sql'</summary>
> <p>
> ```diff
> ...  # here you put 'git diff' results
> ```
> </p>
> </details>
> {code}
> 10. You're ready. Please go for a PR! If the PR contains other minor fixes, 
> use {{[SPARK-XXXXX][SQL][PYTHON]}} prefix in the PR title. If the PR is 
> purely about tests, use {{[SPARK-XXXXX][SQL][PYTHON][TESTS]}}.
>  See [https://github.com/apache/spark/pull/25069] as an example.
> Note that registered UDFs all return strings - so there are some differences 
> are expected.
>  Note that this JIRA targets plan specific cases in general.
>  Note that one {{output.sql.out}} file is shared for three UDF test cases 
> (Scala UDF, Python UDF, and Pandas UDF). Beware of it when you fix the tests.
>  Note that this guide is supposed to be updated continuously given how it 
> goes.
>  Note that this test case uses the integrated UDF test base. See 
> [https://github.com/apache/spark/pull/24752] if you're interested in it or 
> find an issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to