Nick Lothian created SPARK-20878: ------------------------------------ Summary: Pyspark date string parsing erroneously treats 1 as 10 Key: SPARK-20878 URL: https://issues.apache.org/jira/browse/SPARK-20878 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.2 Reporter: Nick Lothian
Pyspark date filter columns can take a String in format yyyy-mm-dd and correctly handle it. This doesn't appear to be documented anywhere (?) but is extremely useful. However, it silently converts the format yyyy-mm-d to yyyy-mm-d0 and yyyy-m-dd to yyyy-m0-dd. For example, 2017-02-1 will be treated as 2017-02-1, and 2017-2-01 as 2017-20-01 (which is invalid, but does not throw an error) This is causes very hard to discover bugs. Test code: {code} from pyspark.sql.types import * from datetime import datetime schema = StructType([StructField("label", StringType(), True),\ StructField("date", DateType(), True)]\ ) data = [('One', datetime.strptime("2017/02/01", '%Y/%m/%d')), ('Two', datetime.strptime("2017/02/02", '%Y/%m/%d')), ('Ten', datetime.strptime("2017/02/10", '%Y/%m/%d')), ('Eleven', datetime.strptime("2017/02/11", '%Y/%m/%d'))] df = sqlContext.createDataFrame(data, schema) df.printSchema() print("All Data") df.show() print("Filter greater than 1 Jan (using 2017-02-1)") df.filter(df.date > '2017-02-1').show() print("Filter greater than 1 Jan (using 2017-02-01)") df.filter(df.date > '2017-02-01').show() print("Filter greater than 1 Jan (using 2017-2-01)") df.filter(df.date > '2017-2-01').show() {code} Output: {code} root |-- label: string (nullable = true) |-- date: date (nullable = true) All Data +------+----------+ | label| date| +------+----------+ | One|2017-02-01| | Two|2017-02-02| | Ten|2017-02-10| |Eleven|2017-02-11| +------+----------+ Filter greater than 1 Feb (using 2017-02-1) +------+----------+ | label| date| +------+----------+ | Ten|2017-02-10| |Eleven|2017-02-11| +------+----------+ Filter greater than 1 Feb (using 2017-02-01) +------+----------+ | label| date| +------+----------+ | Two|2017-02-02| | Ten|2017-02-10| |Eleven|2017-02-11| +------+----------+ Filter greater than 1 Feb (using 2017-2-01) +-----+----+ |label|date| +-----+----+ +-----+----+ {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org