[ https://issues.apache.org/jira/browse/SPARK-23290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349796#comment-16349796 ]
Takuya Ueshin commented on SPARK-23290: --------------------------------------- Thanks for the report! I'm afraid I couldn't figure out what's going on because your example is something wrong. In your first example, the dtype of {{pdf['date']}} seems {{object}}, but the actual type is {{str}}: {code:python} >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) >>> pdf.dtypes date object num int64 dtype: object >>> type(pdf['date'][0]) <type 'str'> {code} So the lambda should work because the function in the lambda is for string type: {code:python} >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) 0 2015-01-01 Name: date, dtype: object {code} Whereas Spark returns {{datetime.date}} in 2.2 and {{pd.Timestamp}} in 2.3: {code:python} >>> df = spark.createDataFrame([('2015-01-01', 1)], ['date', >>> 'num']).selectExpr("cast(date as date)", "num") >>> df.printSchema() root |-- date: date (nullable = true) |-- num: long (nullable = true) >>> df.show() +----------+---+ | date|num| +----------+---+ |2015-01-01| 1| +----------+---+ {code} in 2.2: {code:python} >>> pdf = df.toPandas() >>> pdf.dtypes date object num int64 dtype: object >>> type(pdf['date'][0]) <type 'datetime.date'> {code} in 2.3: {code:python} >>> pdf = df.toPandas() >>> pdf.dtypes date datetime64[ns] num int64 dtype: object >>> type(pdf['date'][0]) <class 'pandas._libs.tslib.Timestamp'> {code} In this case, the lambda shouldn't work anyway. Could you provide some other example to elaborate the problem? IIUC, {{datetime.date}} and {{pd.Timestamp}} are kind of compatible, so we can handle them in the same way. cc: [~bryanc] Thanks! > inadvertent change in handling of DateType when converting to pandas dataframe > ------------------------------------------------------------------------------ > > Key: SPARK-23290 > URL: https://issues.apache.org/jira/browse/SPARK-23290 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.0 > Reporter: Andre Menck > Priority: Blocker > > In [this > PR|https://github.com/apache/spark/pull/18664/files#diff-6fc344560230bf0ef711bb9b5573f1faR1968] > there was a change in how `DateType` is being returned to users (line 1968 > in dataframe.py). This can cause client code to fail, as in the following > example from a python terminal: > {code:python} > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > date object > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > 0 2015-01-01 > Name: date, dtype: object > >>> pdf = pd.DataFrame([['2015-01-01',1]], columns=['date', 'num']) > >>> pdf.dtypes > date object > num int64 > dtype: object > >>> pdf['date'] = pd.to_datetime(pdf['date']) > >>> pdf.dtypes > date datetime64[ns] > num int64 > dtype: object > >>> pdf['date'].apply(lambda d: dt.datetime.strptime(d, '%Y-%m-%d').date() ) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/Users/amenck/anaconda2/lib/python2.7/site-packages/pandas/core/series.py", > line 2355, in apply > mapped = lib.map_infer(values, f, convert=convert_dtype) > File "pandas/_libs/src/inference.pyx", line 1574, in > pandas._libs.lib.map_infer > File "<stdin>", line 1, in <lambda> > TypeError: strptime() argument 1 must be string, not Timestamp > >>> > {code} > Above we show both the old behavior (returning an "object" col) and the new > behavior (returning a datetime column). Since there may be user code relying > on the old behavior, I'd suggest reverting this specific part of this change. > Also note that the NOTE on the docstring for the "_to_corrected_pandas_type" > seems to be off, referring to the old behavior and not the current one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org