Stefaan Lippens created SPARK-40934: ---------------------------------------
Summary: pyspark.pandas.read_csv parses dates, but docs state otherwise Key: SPARK-40934 URL: https://issues.apache.org/jira/browse/SPARK-40934 Project: Spark Issue Type: Bug Components: Pandas API on Spark Affects Versions: 3.3.1 Reporter: Stefaan Lippens from [https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.read_csv.html] : {quote}parse_dates: boolean or list of ints or names or list of lists or dict, default False. Currently only False is allowed. {quote} This documentation suggests that dates are never parsed, but apparently they are always parsed (and it can not be disabled): {code:python} import pyspark.pandas df = pyspark.pandas.read_csv("data.csv", parse_dates=False) print(df) print(df.dtypes) {code} with this data {code:java} date,feature_index,band_0,band_1,band_2 2021-01-05T01:00:00.000+01:00,2,5.0,4.5,3.75 2021-01-05T01:00:00.000+01:00,0,5.0,1.0,2.25 2021-01-05T01:00:00.000+01:00,1,5.0,3.5,4.0 2021-01-15T01:00:00.000+01:00,2,15.0,4.5,3.75 2021-01-15T01:00:00.000+01:00,0,15.0,1.0,2.25 {code} gives {code:java} date feature_index band_0 band_1 band_2 0 2021-01-05 01:00:00 2 5.0 4.5 3.75 1 2021-01-05 01:00:00 0 5.0 1.0 2.25 2 2021-01-05 01:00:00 1 5.0 3.5 4.00 3 2021-01-15 01:00:00 2 15.0 4.5 3.75 4 2021-01-15 01:00:00 0 15.0 1.0 2.25 date datetime64[ns] feature_index int32 band_0 float64 band_1 float64 band_2 float64 dtype: object {code} Notice how the dates are parsed (e.g. dtype {{datetime64[ns]}} for {{date}}) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org