ningsh7 created HIVE-29691:
------------------------------
Summary: Document wrong-result risk for Parquet predicate pushdown
when files contain partition columns
Key: HIVE-29691
URL: https://issues.apache.org/jira/browse/HIVE-29691
Project: Hive
Issue Type: Improvement
Components: Documentation
Affects Versions: 4.0.0-alpha-2, 3.1.3, 3.1.2, 3.1.1, 3.0.0
Environment: Apache Hive 4.0.0-alpha-2 and older Hive versions before
the HIVE-21599 fix.
The issue is relevant to partitioned Parquet tables where the underlying
Parquet files physically contain columns with the same names as Hive partition
columns, and storage-level predicate pushdown is enabled.
Reporter: ningsh7
h3. Description
The current documentation for {{hive.optimize.ppd.storage}} explains that the
option controls whether predicates are pushed down into storage handlers, and
that it is ignored when {{hive.optimize.ppd}} is false.
However, it does not mention a historical Parquet-specific caveat that can
affect users running older Hive versions or downstream Hive 3.x distributions
where the HIVE-21599 fix may not have been backported.
In older versions, for partitioned Parquet tables, if the underlying Parquet
files physically contain columns with the same names as Hive partition columns,
predicates on those partition columns may be converted into Parquet
{{{}FilterPredicate{}}}s and evaluated against the file-level schema. This can
produce wrong results when the file-level values differ from the Hive Metastore
partition values.
This was addressed by HIVE-21599, which excludes partition columns from the
schema used to build Parquet {{{}FilterPredicate{}}}s.
h3. Problem scenario
A partitioned Parquet table is created in Hive:
{{CREATE EXTERNAL TABLE test_table3 (
id string
)
PARTITIONED BY (`date` string)
STORED AS parquet;}}
The underlying Parquet files are written by an external engine and physically
contain both {{id}} and {{date}} columns, while the files are also placed under
a Hive partition location such as:
{{.../test_table3/date=05172021/}}
The partition is registered in Hive Metastore:
{{ALTER TABLE test_table3
ADD PARTITION (`date`='05172021')
LOCATION '.../test_table3/date=05172021';}}
If the Parquet file contains rows whose physical {{date}} column has values
different from the Hive partition value, a query such as:
{{SELECT * FROM test_table3 WHERE `date`='05172021';}}
may return different results depending on whether storage-level predicate
pushdown is enabled.
With storage-level predicate pushdown enabled, the predicate on the Hive
partition column may be converted into a Parquet {{FilterPredicate}} and
evaluated against the physical Parquet column. This may filter out rows before
Hive applies its partition semantics.
With storage-level predicate pushdown disabled, Hive reads the partition
location and treats all rows in that location as belonging to the registered
partition value.
h3. Why this should be documented
This is a silent wrong-result risk for users on affected versions. Although the
root cause is an unsafe data layout, the behavior is exposed through
{{{}hive.optimize.ppd.storage{}}}, which makes it difficult for users to
diagnose.
Documenting this caveat would help users on older Hive versions or downstream
distributions understand the issue and apply one of the safe mitigations:
* avoid writing Hive partition columns into Parquet files;
* upgrade to a Hive version containing the HIVE-21599 fix;
* disable {{hive.optimize.ppd.storage}} as a workaround on affected versions.
h3. Related work
* HIVE-21599: Wrong results for partitioned Parquet table when files contain
partition column
* Commit: {{eb57ac9a0aef456f25b559a4ac225ac9ebf40508}}
* Pull request: apache/hive#3742
* HIVE-25244
--
This message was sent by Atlassian Jira
(v8.20.10#820010)