Michael Allman created SPARK-39833:
--------------------------------------

             Summary: Filtered parquet data frame count() and show() produce 
inconsistent results when spark.sql.parquet.filterPushdown is true
                 Key: SPARK-39833
                 URL: https://issues.apache.org/jira/browse/SPARK-39833
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.2.1
            Reporter: Michael Allman


One of our data scientists discovered a problem wherein a data frame `.show()` 
call printed non-empty results, but `.count()` printed 0. I've narrowed the 
issue to a small, reproducible test case which exhibits this aberrant behavior. 
In pyspark, run the following code:
{code:python}
from pyspark.sql.types import *
parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], 
schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
{code}
In my usage, this prints a data frame with 1 row and a count of 0. However, 
disabling `spark.sql.parquet.filterPushdown` produces consistent results:
{code:python}
spark.conf.set("spark.sql.parquet.filterPushdown", False)
reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
{code}
This will print the same data frame, however it will print a count of 1. The 
key to triggering this bug is not just enabling 
`spark.sql.parquet.filterPushdown` (which is enabled by default). The case of 
the column in the data frame (before writing) must differ from the case of the 
partition column in the file path, i.e. COL0 versus col0 or col0 versus COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to