[ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714
 ] 

Liang-Chi Hsieh commented on SPARK-28058:
-----------------------------------------

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+------+------+
|fruit |color |
+------+------+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+------+------+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> -----------------------------------------------------------------------
>
>                 Key: SPARK-28058
>                 URL: https://issues.apache.org/jira/browse/SPARK-28058
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.1, 2.4.3
>            Reporter: Stuart White
>            Priority: Minor
>              Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +------+------+-----+--------+                                                
>   
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +------+
> |fruit |
> +------+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +------+------+
> |fruit |color |
> +------+------+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +------+------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +------+------+-----+
> |fruit |color |price|
> +------+------+-----+
> |apple |red   |1    |
> |banana|yellow|2    |
> |orange|orange|3    |
> |xxx   |null  |null |
> +------+------+-----+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to