[ https://issues.apache.org/jira/browse/SPARK-26372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-26372: ------------------------------------ Assignee: (was: Apache Spark) > CSV parsing uses previous good value for bad input field > -------------------------------------------------------- > > Key: SPARK-26372 > URL: https://issues.apache.org/jira/browse/SPARK-26372 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Bruce Robbins > Priority: Major > > For example: > {noformat} > bash-3.2$ cat test.csv > "hello",1999-08-01 > "there","bad date" > "again","2017-11-22" > bash-3.2$ bin/spark-shell > ..etc.. > scala> import org.apache.spark.sql.types._ > scala> import org.apache.spark.sql.SaveMode > scala> var schema = StructType(StructField("col1", StringType) :: > | StructField("col2", DateType) :: > | Nil) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(col1,StringType,true), StructField(col2,DateType,true)) > scala> val df = spark.read.schema(schema).csv("test.csv") > df: org.apache.spark.sql.DataFrame = [col1: string, col2: date] > scala> df.show > +-----+----------+ > > | col1| col2| > +-----+----------+ > |hello|1999-08-01| > |there|1999-08-01| > |again|2017-11-22| > +-----+----------+ > scala> > {noformat} > col2 from the second row contains "1999-08-01", when it should contain null. > This is because UnivocityParser reuses the same Row object for each input > record. If there is an exception converting an input field, the code simply > skips over that field, leaving the existing value in the Row object. > The simple fix is to set the column to null in the Row object whenever there > is a badRecordException while converting the input field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org