[ https://issues.apache.org/jira/browse/SPARK-25545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628949#comment-16628949 ]
Steven Bakhtiari commented on SPARK-25545: ------------------------------------------ Somebody on SO pointed me to this older ticket that appears to touch on the same issue. SPARK-10848 > CSV loading with DROPMALFORMED mode doesn't correctly drop rows that do not > confirm to non-nullable schema fields > ----------------------------------------------------------------------------------------------------------------- > > Key: SPARK-25545 > URL: https://issues.apache.org/jira/browse/SPARK-25545 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0, 2.3.1, 2.3.2 > Reporter: Steven Bakhtiari > Priority: Minor > Labels: CSV, csv, csvparser > > I'm loading a CSV file into a dataframe using Spark. I have defined a Schema > and specified one of the fields as non-nullable. > When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with > missing (null) values for those columns to result in the whole row being > dropped. At the moment, the CSV loader correctly drops rows that do not > conform to the field type, but the nullable property is seemingly ignored. > Example CSV input: > {code:java} > 1,2,3 > 1,,3 > ,2,3 > 1,2,abc > {code} > Example Spark job: > {code:java} > val spark = SparkSession > .builder() > .appName("csv-test") > .master("local") > .getOrCreate() > spark.read > .format("csv") > .schema(StructType( > StructField("col1", IntegerType, nullable = false) :: > StructField("col2", IntegerType, nullable = false) :: > StructField("col3", IntegerType, nullable = false) :: Nil)) > .option("header", false) > .option("mode", "DROPMALFORMED") > .load("path/to/file.csv") > .coalesce(1) > .write > .format("csv") > .option("header", false) > .save("path/to/output") > {code} > The actual output will be: > {code:java} > 1,2,3 > 1,,3 > ,2,3{code} > Note that the row containing non-integer values has been dropped, as > expected, but rows containing null values persist, despite the nullable > property being set to false in the schema definition. > My expected output is: > {code:java} > 1,2,3{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org