[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930232#comment-16930232
 ] 

Suchintak Patnaik commented on SPARK-29058:
-------------------------------------------

[~hyukjin.kwon]  If I take quantity is of type int, that record is getting 
dropped, but count showing is incorrect.

There can be situation where few records may not be as per the data type 
defined in the schema and the requirement is to drop such records while loading.

> Reading csv file with DROPMALFORMED showing incorrect record count
> ------------------------------------------------------------------
>
>                 Key: SPARK-29058
>                 URL: https://issues.apache.org/jira/browse/SPARK-29058
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.3.0
>            Reporter: Suchintak Patnaik
>            Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +------+------+-----+--------+
> | Fruit| color|price|quantity|
> +------+------+-----+--------+
> | apple|   red|    1|       3|
> |orange|orange|    3|       5|
> +------+------+-----+--------+
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to