Rich Smith created SPARK-24147: ---------------------------------- Summary: .count() reports wrong size of dataframe when filtering dataframe on Key: SPARK-24147 URL: https://issues.apache.org/jira/browse/SPARK-24147 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 2.2.1 Environment: Spark version 2.2.1
Pyspark Python version 3.6.4 Reporter: Rich Smith Spark reports the wrong size of dataframe using .count() after filtering on a corruptField field. Example file that shows the problem: {code:java} from pyspark.sql import SparkSession from pyspark.sql.types import StringType, StructType, StructField, DoubleType from pyspark.sql.functions import col, lit spark = SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate() spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") SCHEMA = StructType([ StructField("headerDouble", DoubleType(), False), StructField("ErrorField", StringType(), False) ]) dataframe = ( spark.read .option("header", "true") .option("mode", "PERMISSIVE") .option("columnNameOfCorruptRecord", "ErrorField") .schema(SCHEMA).csv("./x.csv") ) total_row_count = dataframe.count() print("total_row_count = " + str(total_row_count)) errors = dataframe.filter(col("ErrorField").isNotNull()) errors.show() error_count = errors.count() print("errors count = " + str(error_count)) {code} Using input file x.csv: {code:java} headerDouble wrong {code} Output text. As shown, contents of dataframe contains a row, but .count() reports 0. {code:java} total_row_count = 1 +------------+----------+ |headerDouble|ErrorField| +------------+----------+ | null| wrong| +------------+----------+ errors count = 0 {code} Also discussed briefly on StackOverflow: https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org