[ https://issues.apache.org/jira/browse/SPARK-48302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ian Cook updated SPARK-48302: ----------------------------- Description: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. A possible fix for this will involve adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. was: Because of a limitation in PyArrow, when PyArrow Tables are passed to {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced with empty lists. The PySpark function where this happens is {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. Also see [https://github.com/apache/arrow/issues/41684]. See the skipped tests and the TODO mentioning SPARK-48302. A possible fix for this will involve adding a {{mask}} argument to {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which PySpark will still support for a while) won't have this argument, we will need to do a check like: {{if LooseVersion(pa.\_\_version\_\_) >= LooseVersion("1X.0.0"):}} or {{from inspect import signature}} {{"mask" in signature(pa.MapArray.from_arrays).parameters}} and only pass {{mask}} if that's true. > Null values in map columns of PyArrow tables are replaced with empty lists > -------------------------------------------------------------------------- > > Key: SPARK-48302 > URL: https://issues.apache.org/jira/browse/SPARK-48302 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 4.0.0 > Reporter: Ian Cook > Priority: Major > > Because of a limitation in PyArrow, when PyArrow Tables are passed to > {{{}spark.createDataFrame(){}}}, null values in MapArray columns are replaced > with empty lists. > The PySpark function where this happens is > {{{}pyspark.sql.pandas.types._check_arrow_array_timestamps_localize{}}}. > Also see [https://github.com/apache/arrow/issues/41684]. > See the skipped tests and the TODO mentioning SPARK-48302. > A possible fix for this will involve adding a {{mask}} argument to > {{{}pa.MapArray.from_arrays{}}}. But since older versions of PyArrow (which > PySpark will still support for a while) won't have this argument, we will > need to do a check like: > {{LooseVersion(pa.__version__) >= LooseVersion("1X.0.0")}} > or > {{from inspect import signature}} > {{"mask" in signature(pa.MapArray.from_arrays).parameters}} > and only pass {{mask}} if that's true. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org