[ 
https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628949#comment-16628949
 ] 

Steven Bakhtiari commented on SPARK-25545:
------------------------------------------

Somebody on SO pointed me to this older ticket that appears to touch on the 
same issue. SPARK-10848

> CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not 
> confirm to non-nullable schema fields
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-25545
>                 URL: https://issues.apache.org/jira/browse/SPARK-25545
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2
>            Reporter: Steven Bakhtiari
>            Priority: Minor
>              Labels: CSV, csv, csvparser
>
> I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
> and specified one of the fields as non-nullable.
> When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
> missing (null) values for those columns to result in the whole row being 
> dropped. At the moment, the CSV loader correctly drops rows that do not 
> conform to the field type, but the nullable property is seemingly ignored.
> Example CSV input:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3
> 1,2,abc
> {code}
> Example Spark job:
> {code:java}
> val spark = SparkSession
>   .builder()
>   .appName("csv-test")
>   .master("local")
>   .getOrCreate()
> spark.read
>   .format("csv")
>   .schema(StructType(
>     StructField("col1", IntegerType, nullable = false) ::
>       StructField("col2", IntegerType, nullable = false) ::
>       StructField("col3", IntegerType, nullable = false) :: Nil))
>   .option("header", false)
>   .option("mode", "DROPMALFORMED")
>   .load("path/to/file.csv")
>   .coalesce(1)
>   .write
>   .format("csv")
>   .option("header", false)
>   .save("path/to/output")
> {code}
> The actual output will be:
> {code:java}
> 1,2,3
> 1,,3
> ,2,3{code}
> Note that the row containing non-integer values has been dropped, as 
> expected, but rows containing null values persist, despite the nullable 
> property being set to false in the schema definition.
> My expected output is:
> {code:java}
> 1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to