[ 
https://issues.apache.org/jira/browse/SPARK-54936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-54936:
----------------------------------
    Description: 
PySpark suffers a lot from behaviour changes in upstreams, like Pandas, 
PyArrow, Numpy.

Add tests to monitor the behaviour of key functions/features, including:
 * pa.array *
 * pa.scalar *
 * pa.Array.from_pandas *
 * pa.Array.to_pandas *
 * pa.Array.cast *
 * pa.Table.from_pandas *
 * pa.Table.from_batches
 * pa.Table.from_arrays
 * pa.Table.from_pydict
 * pa.Table.to_pandas *
 * pa.Table.cast *
 * pa.RecordBatch.from_arrays
 * pa.RecordBatch.from_struct_array
 * pa.RecordBatch.from_pylist
 * pd.Series(data=arrow_data) *
 * pd.Series.astype *
 * pd.api.types.is_xxx *
 * time zone handling in pyarrow, pandas
 * zero copy in pandas<->pyarrow data conversions *
 * ...

 

Items with * are more important in pyspark

 

 

Following data should be taken into account:

1, Missing values, like Nullable data, NaN, pd.Nat, etc;

2, Empty instance (e.g. empty list, empty pa.Table/RecordBatch);

3, invalid values to check the support and error;

4, Python instance (list,tuple,array,etc);

5, Pandas instance

6, Numpy instance 

 

Arguments used in PySpark should be tested, like
{code:java}
pandas_options = {
    "date_as_object": True,
    "coerce_temporal_nanoseconds": True,
} {code}
used in pa.Array.to_pandas

 

All datatype should be tested, including:

1, all Spark datatypes;

2, all PyArrow datatypes;

3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(), 
pd.ArrowDtype(pa.float64()) / double[pyarrow] 

 

 

The new tests should be dedicated for upstream, spark session should not be 
launched.

 

  was:
PySpark suffers a lot from behaviour changes in upstreams, like Pandas, 
PyArrow, Numpy.

Add tests to monitor the behaviour of key functions/features, including:
 * pa.array *
 * pa.scalar *
 * pa.Array.from_pandas *
 * pa.Array.to_pandas *
 * pa.Array.cast *
 * pa.Table.from_pandas *
 * pa.Table.from_batches
 * pa.Table.from_arrays
 * pa.Table.from_pydict
 * pa.Table.to_pandas *
 * pa.Table.cast *
 * pa.RecordBatch.from_arrays
 * pa.RecordBatch.from_struct_array
 * pa.RecordBatch.from_pylist
 * pd.Series(data=arrow_data) *
 * pd.Series.astype *
 * pd.api.types.is_xxx *
 * time zone handling in pyarrow, pandas
 * zero copy in pandas<->pyarrow data conversions *
 * ...

 

Items with * are more important in pyspark

 

 

Following data should be tested:

1, Missing values, like Nullable data, NaN, pd.Nat, etc;

2, Empty instance (e.g. empty list, empty pa.Table/RecordBatch);

3, invalid values to check the support and error;

4, Python instance (list,tuple,array,etc);

5, Pandas instance

6, Numpy instance 

 

Arguments used in PySpark should be tested, like
{code:java}
pandas_options = {
    "date_as_object": True,
    "coerce_temporal_nanoseconds": True,
} {code}
used in pa.Array.to_pandas

 

All datatype should be tested, including:

1, all Spark datatypes;

2, all PyArrow datatypes;

3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(), 
pd.ArrowDtype(pa.float64()) / double[pyarrow] 

 

 

The new tests should be dedicated for upstream, spark session should not be 
launched.

 


> Monitor behaviour changes from upstream 
> ----------------------------------------
>
>                 Key: SPARK-54936
>                 URL: https://issues.apache.org/jira/browse/SPARK-54936
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark, Tests
>    Affects Versions: 4.2.0
>            Reporter: Ruifeng Zheng
>            Priority: Major
>
> PySpark suffers a lot from behaviour changes in upstreams, like Pandas, 
> PyArrow, Numpy.
> Add tests to monitor the behaviour of key functions/features, including:
>  * pa.array *
>  * pa.scalar *
>  * pa.Array.from_pandas *
>  * pa.Array.to_pandas *
>  * pa.Array.cast *
>  * pa.Table.from_pandas *
>  * pa.Table.from_batches
>  * pa.Table.from_arrays
>  * pa.Table.from_pydict
>  * pa.Table.to_pandas *
>  * pa.Table.cast *
>  * pa.RecordBatch.from_arrays
>  * pa.RecordBatch.from_struct_array
>  * pa.RecordBatch.from_pylist
>  * pd.Series(data=arrow_data) *
>  * pd.Series.astype *
>  * pd.api.types.is_xxx *
>  * time zone handling in pyarrow, pandas
>  * zero copy in pandas<->pyarrow data conversions *
>  * ...
>  
> Items with * are more important in pyspark
>  
>  
> Following data should be taken into account:
> 1, Missing values, like Nullable data, NaN, pd.Nat, etc;
> 2, Empty instance (e.g. empty list, empty pa.Table/RecordBatch);
> 3, invalid values to check the support and error;
> 4, Python instance (list,tuple,array,etc);
> 5, Pandas instance
> 6, Numpy instance 
>  
> Arguments used in PySpark should be tested, like
> {code:java}
> pandas_options = {
>     "date_as_object": True,
>     "coerce_temporal_nanoseconds": True,
> } {code}
> used in pa.Array.to_pandas
>  
> All datatype should be tested, including:
> 1, all Spark datatypes;
> 2, all PyArrow datatypes;
> 3, all Pandas datatypes, .e.g for floats, np.float64, pd.Float64Dtype(), 
> pd.ArrowDtype(pa.float64()) / double[pyarrow] 
>  
>  
> The new tests should be dedicated for upstream, spark session should not be 
> launched.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to