[ https://issues.apache.org/jira/browse/SPARK-24147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rich Smith updated SPARK-24147: ------------------------------- Summary: .count() reports wrong size of dataframe when filtering dataframe on corrupt record field (was: .count() reports wrong size of dataframe when filtering dataframe on ) > .count() reports wrong size of dataframe when filtering dataframe on corrupt > record field > ----------------------------------------------------------------------------------------- > > Key: SPARK-24147 > URL: https://issues.apache.org/jira/browse/SPARK-24147 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core > Affects Versions: 2.2.1 > Environment: Spark version 2.2.1 > Pyspark > Python version 3.6.4 > Reporter: Rich Smith > Priority: Major > > Spark reports the wrong size of dataframe using .count() after filtering on a > corruptField field. > Example file that shows the problem: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql.types import StringType, StructType, StructField, DoubleType > from pyspark.sql.functions import col, lit > spark = > SparkSession.builder.master("local[3]").appName("pyspark-unittest").getOrCreate() > spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false") > SCHEMA = StructType([ > StructField("headerDouble", DoubleType(), False), > StructField("ErrorField", StringType(), False) > ]) > dataframe = ( > spark.read > .option("header", "true") > .option("mode", "PERMISSIVE") > .option("columnNameOfCorruptRecord", "ErrorField") > .schema(SCHEMA).csv("./x.csv") > ) > total_row_count = dataframe.count() > print("total_row_count = " + str(total_row_count)) > errors = dataframe.filter(col("ErrorField").isNotNull()) > errors.show() > error_count = errors.count() > print("errors count = " + str(error_count)) > {code} > > > Using input file x.csv: > > {code:java} > headerDouble > wrong > {code} > > > Output text. As shown, contents of dataframe contains a row, but .count() > reports 0. > > {code:java} > total_row_count = 1 > +------------+----------+ > |headerDouble|ErrorField| > +------------+----------+ > | null| wrong| > +------------+----------+ > errors count = 0 > {code} > > > Also discussed briefly on StackOverflow: > https://stackoverflow.com/questions/50121899/how-can-sparks-count-function-be-different-to-the-contents-of-the-dataframe -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org