[
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930258#comment-16930258
]
Suchintak Patnaik commented on SPARK-29058:
-------------------------------------------
[~hyukjin.kwon]
1) As per this
([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126])
*it's disallowed if only the corrupt record column is referenced.*
However, in this case I don't have any corrupt record column defined in my
schema since I am using mode as DROPMALFORMED, not PERMISSIVE.
2) As you mentioned earlier, count() does not need the columns to count, but
here the purpose is to count the rows.
3) Though the workaround is working fine, *df.cache().count()* is not
appropriate to cache in memory if my base dataset is large and before doing a
series of operations on my dataset, I want to drop corrupt records and keep
track of the count.
4) My question is why dataframe count is giving the wrong row count even if it
is discarding the rows.
> Reading csv file with DROPMALFORMED showing incorrect record count
> ------------------------------------------------------------------
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
> Issue Type: Bug
> Components: PySpark, SQL
> Affects Versions: 2.3.0
> Reporter: Suchintak Patnaik
> Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +------+------+-----+--------+
> | Fruit| color|price|quantity|
> +------+------+-----+--------+
> | apple| red| 1| 3|
> |orange|orange| 3| 5|
> +------+------+-----+--------+
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count
> is getting displayed.
> Here the df.count() should give value as 2
>
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]