ningsh7 created HIVE-29691:
------------------------------

             Summary: Document wrong-result risk for Parquet predicate pushdown 
when files contain partition columns
                 Key: HIVE-29691
                 URL: https://issues.apache.org/jira/browse/HIVE-29691
             Project: Hive
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 4.0.0-alpha-2, 3.1.3, 3.1.2, 3.1.1, 3.0.0
         Environment: Apache Hive 4.0.0-alpha-2 and older Hive versions before 
the HIVE-21599 fix.

The issue is relevant to partitioned Parquet tables where the underlying 
Parquet files physically contain columns with the same names as Hive partition 
columns, and storage-level predicate pushdown is enabled.
            Reporter: ningsh7


h3. Description

The current documentation for {{hive.optimize.ppd.storage}} explains that the 
option controls whether predicates are pushed down into storage handlers, and 
that it is ignored when {{hive.optimize.ppd}} is false.

However, it does not mention a historical Parquet-specific caveat that can 
affect users running older Hive versions or downstream Hive 3.x distributions 
where the HIVE-21599 fix may not have been backported.

In older versions, for partitioned Parquet tables, if the underlying Parquet 
files physically contain columns with the same names as Hive partition columns, 
predicates on those partition columns may be converted into Parquet 
{{{}FilterPredicate{}}}s and evaluated against the file-level schema. This can 
produce wrong results when the file-level values differ from the Hive Metastore 
partition values.

This was addressed by HIVE-21599, which excludes partition columns from the 
schema used to build Parquet {{{}FilterPredicate{}}}s.
h3. Problem scenario

A partitioned Parquet table is created in Hive:

{{CREATE EXTERNAL TABLE test_table3 (
  id string
)
PARTITIONED BY (`date` string)
STORED AS parquet;}}

The underlying Parquet files are written by an external engine and physically 
contain both {{id}} and {{date}} columns, while the files are also placed under 
a Hive partition location such as:

{{.../test_table3/date=05172021/}}

The partition is registered in Hive Metastore:

{{ALTER TABLE test_table3
ADD PARTITION (`date`='05172021')
LOCATION '.../test_table3/date=05172021';}}

If the Parquet file contains rows whose physical {{date}} column has values 
different from the Hive partition value, a query such as:

{{SELECT * FROM test_table3 WHERE `date`='05172021';}}

may return different results depending on whether storage-level predicate 
pushdown is enabled.

With storage-level predicate pushdown enabled, the predicate on the Hive 
partition column may be converted into a Parquet {{FilterPredicate}} and 
evaluated against the physical Parquet column. This may filter out rows before 
Hive applies its partition semantics.

With storage-level predicate pushdown disabled, Hive reads the partition 
location and treats all rows in that location as belonging to the registered 
partition value.
h3. Why this should be documented

This is a silent wrong-result risk for users on affected versions. Although the 
root cause is an unsafe data layout, the behavior is exposed through 
{{{}hive.optimize.ppd.storage{}}}, which makes it difficult for users to 
diagnose.

Documenting this caveat would help users on older Hive versions or downstream 
distributions understand the issue and apply one of the safe mitigations:
 * avoid writing Hive partition columns into Parquet files;

 * upgrade to a Hive version containing the HIVE-21599 fix;

 * disable {{hive.optimize.ppd.storage}} as a workaround on affected versions.

h3. Related work
 * HIVE-21599: Wrong results for partitioned Parquet table when files contain 
partition column

 * Commit: {{eb57ac9a0aef456f25b559a4ac225ac9ebf40508}}

 * Pull request: apache/hive#3742

 * HIVE-25244



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to