Steven Bakhtiari created SPARK-25545:
----------------------------------------

             Summary: CSV loading with DROPMALFORMED mode doesn't correctly 
drop rows that do not confirm to non-nullable schema fields
                 Key: SPARK-25545
                 URL: https://issues.apache.org/jira/browse/SPARK-25545
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.2, 2.3.1, 2.3.0
            Reporter: Steven Bakhtiari


I'm loading a CSV file into a dataframe using Spark. I have defined a Schema 
and specified one of the fields as non-nullable.

When setting the mode to {{DROPMALFORMED}}, I expect any rows in the CSV with 
missing (null) values for those columns to result in the whole row being 
dropped. At the moment, the CSV loader correctly drops rows that do not conform 
to the field type, but the nullable property is seemingly ignored.

Example CSV input:
{code:java}
1,2,3
1,,3
,2,3
1,2,abc
{code}
Example Spark job:
{code:java}
val spark = SparkSession
  .builder()
  .appName("csv-test")
  .master("local")
  .getOrCreate()

spark.read
  .format("csv")
  .schema(StructType(
    StructField("col1", IntegerType, nullable = false) ::
      StructField("col2", IntegerType, nullable = false) ::
      StructField("col3", IntegerType, nullable = false) :: Nil))
  .option("header", false)
  .option("mode", "DROPMALFORMED")
  .load("path/to/file.csv")
  .coalesce(1)
  .write
  .format("csv")
  .option("header", false)
  .save("path/to/output")
{code}
The actual output will be:
{code:java}
1,2,3
1,,3
,2,3{code}
Note that the row containing non-integer values has been dropped, as expected, 
but rows containing null values persist, despite the nullable property being 
set to false in the schema definition.

My expected output is:
{code:java}
1,2,3{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to