[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-26259. ---------------------------------- Resolution: Duplicate > RecordSeparator other than newline discovers incorrect schema > ------------------------------------------------------------- > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: PoojaMurarka > Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org