[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

Ivan Sadikov (Jira) Thu, 04 Aug 2022 18:48:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-39833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575510#comment-17575510
 ]


Ivan Sadikov commented on SPARK-39833:
--------------------------------------

It appears to be a bug in Parquet-Mr. 

There is a condition in ParquetFileReader that determines the total number of 
records that we expect when doing predicate pushdown. Depending on whether or 
not column index feature is enabled, we would either return all filtered rows 
from a row group or row ranges. The bug is checking column paths against an 
empty set which is created when the projection is empty. In this case, we 
should return all row group rows instead of failing the condition and returning 
0 records.

> Filtered parquet data frame count() and show() produce inconsistent results 
> when spark.sql.parquet.filterPushdown is true
> -------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39833
>                 URL: https://issues.apache.org/jira/browse/SPARK-39833
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: Michael Allman
>            Priority: Major
>              Labels: correctness
>
> One of our data scientists discovered a problem wherein a data frame 
> `.show()` call printed non-empty results, but `.count()` printed 0. I've 
> narrowed the issue to a small, reproducible test case which exhibits this 
> aberrant behavior. In pyspark, run the following code:
> {code:python}
> from pyspark.sql.types import *
> parquet_pushdown_bug_df = spark.createDataFrame([{"COL0": int(0)}], 
> schema=StructType(fields=[StructField("COL0",IntegerType(),True)]))
> parquet_pushdown_bug_df.repartition(1).write.mode("overwrite").parquet("parquet_pushdown_bug/col0=0/parquet_pushdown_bug.parquet")
> reread_parquet_pushdown_bug_df = spark.read.parquet("parquet_pushdown_bug")
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> print(reread_parquet_pushdown_bug_df.filter("col0 = 0").count())
> {code}
> In my usage, this prints a data frame with 1 row and a count of 0. However, 
> disabling `spark.sql.parquet.filterPushdown` produces consistent results:
> {code:python}
> spark.conf.set("spark.sql.parquet.filterPushdown", False)
> reread_parquet_pushdown_bug_df.filter("col0 = 0").show()
> reread_parquet_pushdown_bug_df.filter("col0 = 0").count()
> {code}
> This will print the same data frame, however it will print a count of 1. The 
> key to triggering this bug is not just enabling 
> `spark.sql.parquet.filterPushdown` (which is enabled by default). The case of 
> the column in the data frame (before writing) must differ from the case of 
> the partition column in the file path, i.e. COL0 versus col0 or col0 versus 
> COL0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39833) Filtered parquet data frame count() and show() produce inconsistent results when spark.sql.parquet.filterPushdown is true

Reply via email to