[ 
https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16348223#comment-16348223
 ] 

Nick Pentreath commented on SPARK-23290:
----------------------------------------

cc [~bryanc]

> inadvertent change in handling of DateType when converting to pandas dataframe
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-23290
>                 URL: https://issues.apache.org/jira/browse/SPARK-23290
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Andre Menck
>            Priority: Major
>
> In [this 
> PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968]
>  there was a change in how `DateType` is being returned to users (line 1968 
> in dataframe.py). This can cause client code to fail, as in the following 
> example from a python terminal:
> {code:python}
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> 0    2015-01-01
> Name: date, dtype: object
> >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num'])
> >>> pdf.dtypes
> date    object
> num      int64
> dtype: object
> >>> pdf['date'] = pd.to_datetime(pdf['date'])
> >>> pdf.dtypes
> date    datetime64[ns]
> num              int64
> dtype: object
> >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() )
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File 
> "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", 
> line 2355, in apply
>     mapped = lib.map_infer(values, f, convert=convert_dtype)
>   File "pandas/_libs/src/inference.pyx", line 1574, in 
> pandas._libs.lib.map_infer
>   File "<stdin>", line 1, in <lambda>
> TypeError: strptime() argument 1 must be string, not Timestamp
> >>> 
> {code}
> Above we show both the old behavior (returning an "object" col) and the new 
> behavior (returning a datetime column). Since there may be user code relying 
> on the old behavior, I'd suggest reverting this specific part of this change. 
> Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" 
> seems to be off, referring to the old behavior and not the current one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to