[ https://issues.apache.org/jira/browse/SPARK-26259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16708161#comment-16708161 ]
PoojaMurarka edited comment on SPARK-26259 at 12/4/18 4:14 AM: --------------------------------------------------------------- The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong. Rather I am looking for setting custom record delimiter while discovery schema i.e. only use *inferschema* as true rather than specifying schema. Let me know if above issue covers both scenarios. was (Author: pooja.murarka): The fix for using custom record delimiters seems to be only available when schema is specified based on the examples. Please correct me if I am wrong. Rather I am looking for setting custom record delimiter while discovery schema i.e. only use inferschema as true rather than specifying schema. Let me know if above issue covers both scenarios. > RecordSeparator other than newline discovers incorrect schema > ------------------------------------------------------------- > > Key: SPARK-26259 > URL: https://issues.apache.org/jira/browse/SPARK-26259 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.0 > Reporter: PoojaMurarka > Priority: Major > > Though JIRA: https://issues.apache.org/jira/browse/SPARK-21289 has been fixed > in SPARK 2.3 which allows record Separators other than new line but this > doesn't work when schema is not specified i.e. while inferring the schema > Let me try to explain this using below data and scenarios: > Input Data - (input_data.csv) as shown below: *+where recordSeparator is > "\t"+* > {noformat} > "dteday","hr","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed" > "2012-01-01","0","0","0","0","1","9","9.1","66","0" > "2012-01-01","1","0","0","0","1","9","7.2","66","9"{noformat} > *Case 1: Schema Defined *: Below Spark code with defined *schema* reads data > correctly: > {code:java} > val customSchema = StructType(Array( > StructField("dteday", DateType, true), > StructField("hr", IntegerType, true), > StructField("holiday", IntegerType, true), > StructField("weekday", IntegerType, true), > StructField("workingday", DateType, true), > StructField("weathersit", IntegerType, true), > StructField("temp", IntegerType, true), > StructField("atemp", DoubleType, true), > StructField("hum", IntegerType, true), > StructField("windspeed", IntegerType, true))); > Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "schema", customSchema) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} > *Case 2: Schema not defined (inferSchema is used):* Incorrect data parsing is > done i.e. entire data is read as column names. > {code:java} > Dataset<Row> ds = executionContext.getSparkSession().read().format( "csv" ) > .option( "header", true ) > .option( "inferSchema", true) > .option( "sep", "," ) > .load( "input_data.csv" ); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org