[ 
https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-16512.
----------------------------------
    Resolution: Duplicate

> No way to load CSV data without dropping whole rows when some of data is not 
> matched with given schema
> ------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-16512
>                 URL: https://issues.apache.org/jira/browse/SPARK-16512
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>            Priority: Minor
>
> Currently, there is no way to read CSV data without dropping whole rows when 
> some of data is not matched with given schema.
> It seems there are some usecases as below:
> {code}
> a,b
> 1,c
> {code}
> Here, {{a}} can be a dirty data in real usecases.
> But codes below:
> {code}
> val path = "/tmp/test.csv"
> val schema = StructType(
>   StructField("a", IntegerType, nullable = true) ::
>   StructField("b", StringType, nullable = true) :: Nil
> val df = spark.read
>   .format("csv")
>   .option("mode", "PERMISSIVE")
>   .schema(schema)
>   .load(path)
> df.show()
> {code}
> emits the exception below:
> {code}
> java.lang.NumberFormatException: For input string: "a"
>       at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>       at java.lang.Integer.parseInt(Integer.java:580)
>       at java.lang.Integer.parseInt(Integer.java:615)
>       at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
>       at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
>       at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244)
> {code}
> With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an 
> exception.
> FYI, this is not the case for JSON because JSON data sources can handle this 
> with {{PERMISSIVE}} mode as below:
> {code}
> val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}"))
> val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil)
> spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show()
> {code}
> {code}
> +----+
> |   a|
> +----+
> |   1|
> |null|
> +----+
> {code}
> Please refer https://github.com/databricks/spark-csv/pull/298



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to