PoojaMurarka created SPARK-26259:
------------------------------------

             Summary: RecordSeparator other than newline discovers incorrect 
schema
                 Key: SPARK-26259
                 URL: https://issues.apache.org/jira/browse/SPARK-26259
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0
            Reporter: PoojaMurarka
             Fix For: 2.4.1


Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed 
in SPARK 2.3 which allows record Separators other than new line but this 
doesn't work when schema is not specified i.e. while inferring the schema

 Let me try to explain this using below data and scenarios:

Input Data - (input_data.csv) as shown below: *+where recordSeparator is "\t"+*
{noformat}
"dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed"
    "2012-01-01","0","0","0","0","1","9","9.1","66","0"    
"2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat}
*Case 1: Schema Defined *: Below Spark code with defined *schema* reads data 
correctly:
{code:java}
val customSchema = StructType(Array(
        StructField("dteday", DateType, true),
        StructField("hr", IntegerType, true),
        StructField("holiday", IntegerType, true),
        StructField("weekday", IntegerType, true),
        StructField("workingday", DateType, true),
        StructField("weathersit", IntegerType, true),
        StructField("temp", IntegerType, true),
        StructField("atemp", DoubleType, true),
        StructField("hum", IntegerType, true),
        StructField("windspeed", IntegerType, true)));

Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
          .option( "header", true )
          .option( "schema", customSchema)
          .option( "sep", "," )
          .load( "input_data.csv" );
{code}
*Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is 
done i.e. entire data is read as column names.
{code:java}
Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" )
          .option( "header", true )
          .option( "inferSchema", true)
          .option( "sep", "," )
          .load( "input_data.csv" );
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to