[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

Nicolas Renkamp (Jira) Thu, 27 Feb 2020 00:41:17 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046337#comment-17046337
 ]


Nicolas Renkamp commented on SPARK-30961:
-----------------------------------------

[~bryanc] Thanks for the quick reply. 

Actually, I was using pyarrow 0.15.1 in this example. I did not really realize 
that spark 2.4.X should be used together with version 0.8.0 or highest 0.11.1 
of pyarrow.

Thanks for the background information and the links to the other issues.

 

>From a user's perspective it would be great of spark 3.X would be compatible 
>with the latest pyarrow version. Are you aiming for that?

> Arrow enabled: to_pandas with date column fails
> -----------------------------------------------
>
>                 Key: SPARK-30961
>                 URL: https://issues.apache.org/jira/browse/SPARK-30961
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.4.5
>         Environment: Apache Spark 2.4.5
>            Reporter: Nicolas Renkamp
>            Priority: Major
>              Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
>     [['2019-12-06']], 'created_at: string') \
>     .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
>     """
>     Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.    :param series: pandas.Series
>     :param data_type: a Spark data type for the series
>     """
>     from pyspark.sql.utils import require_minimum_pandas_version
>     require_minimum_pandas_version()    from pandas import to_datetime
>     if type(data_type) == DateType:
>         return to_datetime(series).dt.date
>     else:
>         return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

Reply via email to