[ https://issues.apache.org/jira/browse/SPARK-16512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-16512. ---------------------------------- Resolution: Duplicate > No way to load CSV data without dropping whole rows when some of data is not > matched with given schema > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-16512 > URL: https://issues.apache.org/jira/browse/SPARK-16512 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.0.0 > Reporter: Hyukjin Kwon > Priority: Minor > > Currently, there is no way to read CSV data without dropping whole rows when > some of data is not matched with given schema. > It seems there are some usecases as below: > {code} > a,b > 1,c > {code} > Here, {{a}} can be a dirty data in real usecases. > But codes below: > {code} > val path = "/tmp/test.csv" > val schema = StructType( > StructField("a", IntegerType, nullable = true) :: > StructField("b", StringType, nullable = true) :: Nil > val df = spark.read > .format("csv") > .option("mode", "PERMISSIVE") > .schema(schema) > .load(path) > df.show() > {code} > emits the exception below: > {code} > java.lang.NumberFormatException: For input string: "a" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:580) > at java.lang.Integer.parseInt(Integer.java:615) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:29) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:244) > {code} > With {{DROPMALFORM}} and {{FAILFAST}}, it will be dropped or failed with an > exception. > FYI, this is not the case for JSON because JSON data sources can handle this > with {{PERMISSIVE}} mode as below: > {code} > val rdd = spark.sparkContext.makeRDD(Seq("{\"a\" : 1}", "{\"a\" : \"a\"}")) > val schema = StructType(StructField("a", IntegerType, nullable = true) :: Nil) > spark.read.option("mode", "PERMISSIVE").schema(schema).json(rdd).show() > {code} > {code} > +----+ > | a| > +----+ > | 1| > |null| > +----+ > {code} > Please refer https://github.com/databricks/spark-csv/pull/298 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org