[
https://issues.apache.org/jira/browse/SPARK-54936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruifeng Zheng updated SPARK-54936:
----------------------------------
Description:
PySpark suffers a lot from behaviour changes in upstreams, like Pandas,
PyArrow, Numpy.
Add tests to monitor the behaviour of key functions/features, including:
* pa.array *
* pa.scalar *
* pa.Array.from_pandas *
* pa.Array.to_pandas *
* pa.Array.cast *
* pa.Table.from_pandas *
* pa.Table.from_batches
* pa.Table.from_arrays
* pa.Table.from_pydict
* pa.Table.to_pandas *
* pa.Table.cast *
* pa.RecordBatch.from_arrays
* pa.RecordBatch.from_struct_array
* pa.RecordBatch.from_pylist
* pd.Series(data=arrow_data) *
* pd.Series.astype *
* pd.api.types.is_xxx *
* time zone handling in pyarrow, pandas *
* zero copy in pandas<->pyarrow data conversions *
* ...
Items with * are more important in pyspark
Following data should be taken into account:
1, Missing values, like Nullable data, NaN, pd.Nat, etc;
2, Empty dataset (e.g. empty list/dict, empty pa.Table/RecordBatch);
3, invalid values to check the support and error;
4, Python instance (list,tuple,array,etc);
5, Pandas instance
6, Numpy instance
Arguments used in PySpark should be taken into account, like
{code:java}
pandas_options = {
"date_as_object": True,
"coerce_temporal_nanoseconds": True,
} {code}
used in pa.Array.to_pandas
All datatypes should be tested, including:
1, all Spark datatypes;
2, all PyArrow datatypes;
3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(),
pd.ArrowDtype(pa.float64()) / double[pyarrow]
The new tests should be dedicated for upstream, spark session should not be
launched.
was:
PySpark suffers a lot from behaviour changes in upstreams, like Pandas,
PyArrow, Numpy.
Add tests to monitor the behaviour of key functions/features, including:
* pa.array *
* pa.scalar *
* pa.Array.from_pandas *
* pa.Array.to_pandas *
* pa.Array.cast *
* pa.Table.from_pandas *
* pa.Table.from_batches
* pa.Table.from_arrays
* pa.Table.from_pydict
* pa.Table.to_pandas *
* pa.Table.cast *
* pa.RecordBatch.from_arrays
* pa.RecordBatch.from_struct_array
* pa.RecordBatch.from_pylist
* pd.Series(data=arrow_data) *
* pd.Series.astype *
* pd.api.types.is_xxx *
* time zone handling in pyarrow, pandas *
* zero copy in pandas<->pyarrow data conversions *
* ...
Items with * are more important in pyspark
Following data should be taken into account:
1, Missing values, like Nullable data, NaN, pd.Nat, etc;
2, Empty instance (e.g. empty list, empty pa.Table/RecordBatch);
3, invalid values to check the support and error;
4, Python instance (list,tuple,array,etc);
5, Pandas instance
6, Numpy instance
Arguments used in PySpark should be taken into account, like
{code:java}
pandas_options = {
"date_as_object": True,
"coerce_temporal_nanoseconds": True,
} {code}
used in pa.Array.to_pandas
All datatypes should be tested, including:
1, all Spark datatypes;
2, all PyArrow datatypes;
3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(),
pd.ArrowDtype(pa.float64()) / double[pyarrow]
The new tests should be dedicated for upstream, spark session should not be
launched.
> Monitor behaviour changes from upstream
> ----------------------------------------
>
> Key: SPARK-54936
> URL: https://issues.apache.org/jira/browse/SPARK-54936
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark, Tests
> Affects Versions: 4.2.0
> Reporter: Ruifeng Zheng
> Priority: Major
>
> PySpark suffers a lot from behaviour changes in upstreams, like Pandas,
> PyArrow, Numpy.
> Add tests to monitor the behaviour of key functions/features, including:
> * pa.array *
> * pa.scalar *
> * pa.Array.from_pandas *
> * pa.Array.to_pandas *
> * pa.Array.cast *
> * pa.Table.from_pandas *
> * pa.Table.from_batches
> * pa.Table.from_arrays
> * pa.Table.from_pydict
> * pa.Table.to_pandas *
> * pa.Table.cast *
> * pa.RecordBatch.from_arrays
> * pa.RecordBatch.from_struct_array
> * pa.RecordBatch.from_pylist
> * pd.Series(data=arrow_data) *
> * pd.Series.astype *
> * pd.api.types.is_xxx *
> * time zone handling in pyarrow, pandas *
> * zero copy in pandas<->pyarrow data conversions *
> * ...
>
> Items with * are more important in pyspark
>
>
> Following data should be taken into account:
> 1, Missing values, like Nullable data, NaN, pd.Nat, etc;
> 2, Empty dataset (e.g. empty list/dict, empty pa.Table/RecordBatch);
> 3, invalid values to check the support and error;
> 4, Python instance (list,tuple,array,etc);
> 5, Pandas instance
> 6, Numpy instance
>
> Arguments used in PySpark should be taken into account, like
> {code:java}
> pandas_options = {
> "date_as_object": True,
> "coerce_temporal_nanoseconds": True,
> } {code}
> used in pa.Array.to_pandas
>
> All datatypes should be tested, including:
> 1, all Spark datatypes;
> 2, all PyArrow datatypes;
> 3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(),
> pd.ArrowDtype(pa.float64()) / double[pyarrow]
>
>
> The new tests should be dedicated for upstream, spark session should not be
> launched.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]