[jira] [Updated] (SPARK-24147) .count() reports wrong size of dataframe when filtering dataframe on corrupt record field

Rich Smith (JIRA) Wed, 02 May 2018 04:01:26 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rich Smith updated SPARK-24147:
-------------------------------
    Summary: .count() reports wrong size of dataframe when filtering dataframe 
on corrupt record field  (was: .count() reports wrong size of dataframe when 
filtering dataframe on  )

> .count() reports wrong size of dataframe when filtering dataframe on corrupt 
> record field
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-24147
>                 URL: https://issues.apache.org/jira/browse/SPARK-24147
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, Spark Core
>    Affects Versions: 2.2.1
>         Environment: Spark version 2.2.1
> Pyspark 
> Python version 3.6.4
>            Reporter: Rich Smith
>            Priority: Major
>
> Spark reports the wrong size of dataframe using .count() after filtering on a 
> corruptField field.
> Example file that shows the problem:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import StringType, StructType, StructField, DoubleType
> from pyspark.sql.functions import col, lit
> spark = 
> SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
> spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
> SCHEMA = StructType([
>     StructField("headerDouble", DoubleType(), False),
>     StructField("ErrorField", StringType(), False)
> ])
> dataframe = (
>     spark.read
>     .option("header", "true")
>     .option("mode", "PERMISSIVE")
>     .option("columnNameOfCorruptRecord", "ErrorField")
>     .schema(SCHEMA).csv("./x.csv")
> )
> total_row_count = dataframe.count()
> print("total_row_count = " + str(total_row_count))
> errors = dataframe.filter(col("ErrorField").isNotNull())
> errors.show()
> error_count = errors.count()
> print("errors count = " + str(error_count))
> {code}
>  
>  
> Using input file x.csv:
>  
> {code:java}
> headerDouble
> wrong
> {code}
>  
>  
> Output text. As shown, contents of dataframe contains a row, but .count() 
> reports 0.
>  
> {code:java}
> total_row_count = 1
> +------------+----------+
> |headerDouble|ErrorField|
> +------------+----------+
> | null| wrong|
> +------------+----------+
> errors count = 0
> {code}
>  
>  
> Also discussed briefly on StackOverflow: 
> https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24147) .count() reports wrong size of dataframe when filtering dataframe on corrupt record field

Reply via email to