[ https://issues.apache.org/jira/browse/SPARK-13309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin resolved SPARK-13309. --------------------------------- Resolution: Fixed Assignee: Rahul Tanwani Fix Version/s: 2.0.0 > Incorrect type inference for CSV data. > -------------------------------------- > > Key: SPARK-13309 > URL: https://issues.apache.org/jira/browse/SPARK-13309 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.0 > Reporter: Rahul Tanwani > Assignee: Rahul Tanwani > Priority: Minor > Fix For: 2.0.0 > > > Type inference for CSV data does not work as expected when the data is > sparse. > For instance: Consider the following datasets and the inferred schema: > {code} > A,B,C,D > 1,,, > ,1,, > ,,1, > ,,,1 > {code} > {code} > root > |-- A: integer (nullable = true) > |-- B: integer (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here all the fields should have been inferred as Integer types, but clearly > the inferred schema is different. > Another dataset: > {code} > A,B,C,D > 1,,1, > {code} > and the inferred schema: > {code} > root > |-- A: string (nullable = true) > |-- B: string (nullable = true) > |-- C: string (nullable = true) > |-- D: string (nullable = true) > {code} > Here, fields A & C should be inferred as Integer types. > Same issue has been discussed on spark-csv package. Please take a look at > https://github.com/databricks/spark-csv/issues/216 for reference. > The issue was fixed with > https://github.com/databricks/spark-csv/commit/8704b26030da88ac6e18b955a81d5c22ca3b480d. > I will try to submit PR with the patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org