[ https://issues.apache.org/jira/browse/SPARK-25517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun resolved SPARK-25517. ----------------------------------- Resolution: Duplicate According to the comments on the PR, I'll close this as `Duplicate` for now. > Spark DataFrame option inferSchema="true", dataFormat=MM/dd/yyyy, fails to > detect date type from the csv file while reading > --------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-25517 > URL: https://issues.apache.org/jira/browse/SPARK-25517 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0, 2.3.1 > Environment: Spark 2.3.0 > Reporter: Manoranjan Kumar > Priority: Major > Labels: easyfix > Original Estimate: 48h > Remaining Estimate: 48h > > spark.read.format("csv").option("inferSchema", true).option("dateFormat", > "MM/dd/yyyy") fails to detect or infer the date type while reading the csv > file having date column in the specified format(MM/dd/yyyy) > For example:- > An employee csv file (employee.csv) has following two sample dummy records > (with header): > emp_id,emp_name,joining_date,emp_age, emp_in_time,emp_salary > 100,Bradd Pitt,{color:#f6c342}09/25/2018{color},26,{color:#f691b2}09/25/2018 > 10:12:36{color},10000.00 > 101,Angel Joli,{color:#f6c342}08/20/2018{color},28,{color:#f691b2}08/20/2018 > 11:32:58{color},12000.00 > when I read the above csv file as dataframe like below: > val empDF = spark.read.format("csv").option("inferSchema", > true).option("dateFormat","MM/dd/yyyy").option("timestampFormat","MM/dd/yyyy > HH:mm:ss").load(employee.csv) > empDF.printSchema() > results/output: > root > |-- emp_id: integer (nullable = true) > |-- emp_name: string (nullable = true) > |-- {color:#d04437}joining_date: string{color} (nullable = true) > |-- emp_age: integer (nullable = true) > |-- {color:#d04437}emp_in_time: timestamp{color} (nullable = true) > |-- emp_salary: double (nullable = true) > Please notice above (marked in {color:#d04437}red{color} color) the data type > automatically inferred by spark for joining_date and emp_in_time, for > joining_date, it fails to detect as date type and the type remains as > {color:#d04437}string{color} as it is, whereas it detects well for > emp_in_time as {color:#d04437}timestamp{color} > This was the issue that I struggled with for a complete day, and when I dived > deep into the spark source code, i found the implementation for date type is > missing whereas the implementation for timestamp is present in all its glory. > I am new to this place (exactly first timer), please get back in case of > further information or live example with running code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org