[
https://issues.apache.org/jira/browse/SPARK-54936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruifeng Zheng updated SPARK-54936:
----------------------------------
Description:
PySpark suffers a lot from behaviour changes from upstream communities, like
Pandas, PyArrow, Numpy.
We should add tests to monitor the behaviour of key functions/features, like:
* pa.Array.to_pandas
* pa.Array.from_pandas
* pa.Array.cast
* pa.array
* pa.scalar
* pd.Series(data=arrow_data)
* time zone handling in pyarrow, pandas
* zero copy in pandas<->pyarrow data conversions
* etc
The new tests should be dedicated for upstream, spark stuffs should not be
involved.
The test data should include:
1, Missing values, like Nullable data, NaN, pd.Nat, etc;
2, empty instance (e.g. empty list)
3, plain python instance (list,tuple,array,etc);
4, Pandas instance
5, Numpy instance
was:
PySpark suffers a lot from behaviour changes from upstream communities, like
Pandas, PyArrow, Numpy.
We should add tests to monitor the behaviour of key functions/features, like:
* pa.Array.to_pandas
* pa.Array.from_pandas
* pa.Array.cast
* pa.array
* pa.scalar
* pd.Series(data=arrow_data)
* time zone handling in pyarrow, pandas
* zero copy in pandas<->pyarrow data conversions
* etc
The new tests should be dedicated for upstream, spark stuffs should not be
involved.
The input data should include:
1, Missing values, like Nullable data, NaN, pd.Nat, etc;
2, empty instance (e.g. empty list)
3, plain python instance (list,tuple,array,etc);
4, Pandas instance
5, Numpy instance
> Monitor upstream behaviour changes
> ----------------------------------
>
> Key: SPARK-54936
> URL: https://issues.apache.org/jira/browse/SPARK-54936
> Project: Spark
> Issue Type: Umbrella
> Components: PySpark, Tests
> Affects Versions: 4.2.0
> Reporter: Ruifeng Zheng
> Priority: Major
>
> PySpark suffers a lot from behaviour changes from upstream communities, like
> Pandas, PyArrow, Numpy.
> We should add tests to monitor the behaviour of key functions/features, like:
> * pa.Array.to_pandas
> * pa.Array.from_pandas
> * pa.Array.cast
> * pa.array
> * pa.scalar
> * pd.Series(data=arrow_data)
> * time zone handling in pyarrow, pandas
> * zero copy in pandas<->pyarrow data conversions
> * etc
>
> The new tests should be dedicated for upstream, spark stuffs should not be
> involved.
>
> The test data should include:
> 1, Missing values, like Nullable data, NaN, pd.Nat, etc;
> 2, empty instance (e.g. empty list)
> 3, plain python instance (list,tuple,array,etc);
> 4, Pandas instance
> 5, Numpy instance
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]