[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

Liang-Chi Hsieh (JIRA) Mon, 17 Jun 2019 09:00:10 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-28058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865714#comment-16865714
 ]


Liang-Chi Hsieh edited comment on SPARK-28058 at 6/17/19 3:59 PM:
------------------------------------------------------------------

[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+------+------+
|fruit |color |
+------+------+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+------+------+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

This seems to be inherited from Univocity parser, when we use 
{{CsvParserSettings.selectIndexes}} to do field selection. In above case, the 
parser returns two tokens where the second token is just null. I'm not sure if 
it is known behavior of Univocity parser, or it is a bug at Univocity parser.


was (Author: viirya):
[~hyukjin.kwon] Do you mean this is suspect to be a bug:

{code}
scala> spark.read.option("header", "true").option("mode", 
"DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
+------+------+
|fruit |color |
+------+------+
|apple |red   |
|banana|yellow|
|orange|orange|
|xxx   |null  |
+------+------+
{code}

In this case, the reader should read two columns. But the corrupted record has 
only one column. Reasonably, it should be dropped as a malformed one. But we 
see the missing column is filled with null.

> Reading csv with DROPMALFORMED sometimes doesn't drop malformed records
> -----------------------------------------------------------------------
>
>                 Key: SPARK-28058
>                 URL: https://issues.apache.org/jira/browse/SPARK-28058
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.1, 2.4.3
>            Reporter: Stuart White
>            Priority: Minor
>              Labels: CSV, csv, csvparser
>
> The spark sql csv reader is not dropping malformed records as expected.
> Consider this file (fruit.csv).  Notice it contains a header record, 3 valid 
> records, and one malformed record.
> {noformat}
> fruit,color,price,quantity
> apple,red,1,3
> banana,yellow,2,4
> orange,orange,3,5
> xxx
> {noformat}
> If I read this file using the spark sql csv reader as follows, everything 
> looks good.  The malformed record is dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").show(truncate=false)
> +------+------+-----+--------+                                                
>   
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> However, if I select a subset of the columns, the malformed record is not 
> dropped.  The malformed data is placed in the first column, and the remaining 
> column(s) are filled with nulls.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit).show(truncate=false)
> +------+
> |fruit |
> +------+
> |apple |
> |banana|
> |orange|
> |xxx   |
> +------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color).show(truncate=false)
> +------+------+
> |fruit |color |
> +------+------+
> |apple |red   |
> |banana|yellow|
> |orange|orange|
> |xxx   |null  |
> +------+------+
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 
> 'price).show(truncate=false)
> +------+------+-----+
> |fruit |color |price|
> +------+------+-----+
> |apple |red   |1    |
> |banana|yellow|2    |
> |orange|orange|3    |
> |xxx   |null  |null |
> +------+------+-----+
> {noformat}
> And finally, if I manually select all of the columns, the malformed record is 
> once again dropped.
> {noformat}
> scala> spark.read.option("header", "true").option("mode", 
> "DROPMALFORMED").csv("fruit.csv").select('fruit, 'color, 'price, 
> 'quantity).show(truncate=false)
> +------+------+-----+--------+
> |fruit |color |price|quantity|
> +------+------+-----+--------+
> |apple |red   |1    |3       |
> |banana|yellow|2    |4       |
> |orange|orange|3    |5       |
> +------+------+-----+--------+
> {noformat}
> I would expect the malformed record(s) to be dropped regardless of which 
> columns are being selected from the file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28058) Reading csv with DROPMALFORMED sometimes doesn't drop malformed records

Reply via email to