Rich Smith created SPARK-24147:
----------------------------------

             Summary: .count() reports wrong size of dataframe when filtering 
dataframe on  
                 Key: SPARK-24147
                 URL: https://issues.apache.org/jira/browse/SPARK-24147
             Project: Spark
          Issue Type: Bug
          Components: PySpark, Spark Core
    Affects Versions: 2.2.1
         Environment: Spark version 2.2.1

Pyspark 

Python version 3.6.4
            Reporter: Rich Smith


Spark reports the wrong size of dataframe using .count() after filtering on a 
corruptField field.

Example file that shows the problem:

 
{code:java}
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructType, StructField, DoubleType
from pyspark.sql.functions import col, lit

spark = 
SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate()
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")


SCHEMA = StructType([
    StructField("headerDouble", DoubleType(), False),
    StructField("ErrorField", StringType(), False)
])

dataframe = (
    spark.read
    .option("header", "true")
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "ErrorField")
    .schema(SCHEMA).csv("./x.csv")
)

total_row_count = dataframe.count()
print("total_row_count = " + str(total_row_count))

errors = dataframe.filter(col("ErrorField").isNotNull())
errors.show()
error_count = errors.count()
print("errors count = " + str(error_count))
{code}
 

 

Using input file x.csv:

 
{code:java}
headerDouble
wrong
{code}
 

 

Output text. As shown, contents of dataframe contains a row, but .count() 
reports 0.

 
{code:java}
total_row_count = 1
+------------+----------+
|headerDouble|ErrorField|
+------------+----------+
| null| wrong|
+------------+----------+

errors count = 0
{code}
 

 

Also discussed briefly on StackOverflow: 

https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to