PoojaMurarka created SPARK-26259: ------------------------------------ Summary: RecordSeparator other than newline discovers incorrect schema Key: SPARK-26259 URL: https://issues.apache.org/jira/browse/SPARK-26259 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Reporter: PoojaMurarka Fix For: 2.4.1
Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed in SPARK 2.3 which allows record Separators other than new line but this doesn't work when schema is not specified i.e. while inferring the schema Let me try to explain this using below data and scenarios: Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+* {noformat} "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" "2012-01-01","0","0","0","0","1","9","9.1","66","0" "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data correctly: {code:java} val customSchema = StructType(Array( StructField("dteday", DateType, true), StructField("hr", IntegerType, true), StructField("holiday", IntegerType, true), StructField("weekday", IntegerType, true), StructField("workingday", DateType, true), StructField("weathersit", IntegerType, true), StructField("temp", IntegerType, true), StructField("atemp", DoubleType, true), StructField("hum", IntegerType, true), StructField("windspeed", IntegerType, true))); Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) .option( "header", true ) .option( "schema", customSchema) .option( "sep", "," ) .load( "input_data.csv" ); {code} *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is done i.e. entire data is read as column names. {code:java} Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) .option( "header", true ) .option( "inferSchema", true) .option( "sep", "," ) .load( "input_data.csv" ); {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org