[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

Ian Cook (Jira) Thu, 16 May 2024 11:44:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ian Cook updated SPARK-48302:
-----------------------------
    Description: 
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
with empty lists.

The PySpark function where this happens is 
{{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
PySpark will still support for a while) won't have this argument, we will need 
to do a check like:

{{if LooseVersion(pa.__version__) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.

  was:
Because of a limitation in PyArrow, when PyArrow Tables are passed to 
{{spark.createDataFrame()}}, null values in MapArray columns are replaced with 
empty lists.

The PySpark function where this happens is 
{{pyspark.sql.pandas.types._check_arrow_array_timestamps_localize}}.

Also see [https://github.com/apache/arrow/issues/41684].

See the skipped tests and the TODO mentioning SPARK-48302.

A possible fix for this will involve adding a {{mask}} argument to 
{{pa.MapArray.from_arrays}}. But since older versions of PyArrow (which PySpark 
will still support for a while) won't have this argument, we will need to do a 
check like:

{{if LooseVersion(pa._{_}version{_}_) >= LooseVersion("1X.0.0"):}}

or

{{from inspect import signature}}
{{"mask" in signature(pa.MapArray.from_arrays).parameters}}

and only pass {{mask}} if that's true.


> Null values in map columns of PyArrow tables are replaced with empty lists
> --------------------------------------------------------------------------
>
>                 Key: SPARK-48302
>                 URL: https://issues.apache.org/jira/browse/SPARK-48302
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 4.0.0
>            Reporter: Ian Cook
>            Priority: Major
>
> Because of a limitation in PyArrow, when PyArrow Tables are passed to 
> {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced 
> with empty lists.
> The PySpark function where this happens is 
> {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}.
> Also see [https://github.com/apache/arrow/issues/41684].
> See the skipped tests and the TODO mentioning SPARK-48302.
> A possible fix for this will involve adding a {{mask}} argument to 
> {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which 
> PySpark will still support for a while) won't have this argument, we will 
> need to do a check like:
> {{if LooseVersion(pa.__version__) >= LooseVersion("1X.0.0"):}}
> or
> {{from inspect import signature}}
> {{"mask" in signature(pa.MapArray.from_arrays).parameters}}
> and only pass {{mask}} if that's true.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48302) Null values in map columns of PyArrow tables are replaced with empty lists

Reply via email to