Matthias Roels created SPARK-44946:
--------------------------------------
Summary: toPandas() gives FutureWarning when containing columns of
datatype timestamp
Key: SPARK-44946
URL: https://issues.apache.org/jira/browse/SPARK-44946
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 3.4.1
Reporter: Matthias Roels
When converting a Spark DataFrame into a pandas DataFrame, we get a
FutureWarning when the DataFrame contains columns of type {{timestamp. }}
Reproducible example (that you can run locally):
{code:java}
from datetime import datetime
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame({"foo": [datetime(2023, 1, 1), datetime(2023, 1, 1)]})
df_sp = spark.createDataFrame(df)
test = df_sp.toPandas()
// warning logs:
/usr/local/lib/python3.10/site-packages/pyspark/sql/pandas/conversion.py:251:
FutureWarning: Passing unit-less datetime64 dtype to .astype is deprecated and
will raise in a future version. Pass 'datetime64[ns]' instead
{code}
Note that if we enable arrow (by setting
{{{}config("spark.sql.execution.arrow.pyspark.enabled", "true"){}}}), this
warning is gone! Although I admit that I have seen it popping up once, but I
could not create a reproducible example out of that.
This means that I cannot use Spark with pandas 2.0 without Arrow enabled...
For my test, I ran it in a docker container:
* Python version: python 3.10 (base image python:3.10-slim-bullseye)
* Java: openjdk-17-jre-headless
* Spark: 3.4.1
* pandas: 1.5.3
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]