[ 
https://issues.apache.org/jira/browse/HIVE-29691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ningsh7 updated HIVE-29691:
---------------------------
    Description: 
h3. Description

The current documentation for {{hive.optimize.ppd.storage}} explains that the 
option controls whether predicates are pushed down into storage handlers, and 
that it is ignored when {{hive.optimize.ppd}} is false.

However, it does not mention a historical Parquet-specific caveat that can 
affect users running older Hive versions or downstream Hive 3.x distributions 
where the HIVE-21599 fix may not have been backported.

In older versions, for partitioned Parquet tables, if the underlying Parquet 
files physically contain columns with the same names as Hive partition columns, 
predicates on those partition columns may be converted into Parquet 
{{{}FilterPredicate{}}}s and evaluated against the file-level schema. This can 
produce wrong results when the file-level values differ from the Hive Metastore 
partition values.

This was addressed by HIVE-21599, which excludes partition columns from the 
schema used to build Parquet {{{}FilterPredicate{}}}s.
h3. Problem scenario

A partitioned Parquet table is created in Hive:
{code:java}
CREATE EXTERNAL TABLE test_table3 (
id string
)
PARTITIONED BY (`date` string)
STORED AS parquet;{code}
The underlying Parquet files are written by an external engine and physically 
contain both {{id}} and {{date}} columns, while the files are also placed under 
a Hive partition location such as:
{code:java}
.../test_table3/date=05172021/{code}
The partition is registered in Hive Metastore:
{code:java}
ALTER TABLE test_table3
ADD PARTITION (`date`='05172021')
LOCATION '.../test_table3/date=05172021';{code}
If the Parquet file contains rows whose physical {{date}} column has values 
different from the Hive partition value, a query such as:
{code:java}
SELECT * FROM test_table3 WHERE `date`='05172021';{code}
may return different results depending on whether storage-level predicate 
pushdown is enabled.

With storage-level predicate pushdown enabled, the predicate on the Hive 
partition column may be converted into a Parquet {{FilterPredicate}} and 
evaluated against the physical Parquet column. This may filter out rows before 
Hive applies its partition semantics.

With storage-level predicate pushdown disabled, Hive reads the partition 
location and treats all rows in that location as belonging to the registered 
partition value.
h3. Why this should be documented

This is a silent wrong-result risk for users on affected versions. Although the 
root cause is an unsafe data layout, the behavior is exposed through 
{{{}hive.optimize.ppd.storage{}}}, which makes it difficult for users to 
diagnose.

Documenting this caveat would help users on older Hive versions or downstream 
distributions understand the issue and apply one of the safe mitigations:
 * avoid writing Hive partition columns into Parquet files;

 * upgrade to a Hive version containing the HIVE-21599 fix;

 * disable {{hive.optimize.ppd.storage}} as a workaround on affected versions.

h3. Related work
 * HIVE-21599: Wrong results for partitioned Parquet table when files contain 
partition column

 * Commit: {{eb57ac9a0aef456f25b559a4ac225ac9ebf40508}}

 * Pull request: apache/hive#3742

 * HIVE-25244

  was:
h3. Description

The current documentation for {{hive.optimize.ppd.storage}} explains that the 
option controls whether predicates are pushed down into storage handlers, and 
that it is ignored when {{hive.optimize.ppd}} is false.

However, it does not mention a historical Parquet-specific caveat that can 
affect users running older Hive versions or downstream Hive 3.x distributions 
where the HIVE-21599 fix may not have been backported.

In older versions, for partitioned Parquet tables, if the underlying Parquet 
files physically contain columns with the same names as Hive partition columns, 
predicates on those partition columns may be converted into Parquet 
{{{}FilterPredicate{}}}s and evaluated against the file-level schema. This can 
produce wrong results when the file-level values differ from the Hive Metastore 
partition values.

This was addressed by HIVE-21599, which excludes partition columns from the 
schema used to build Parquet {{{}FilterPredicate{}}}s.
h3. Problem scenario

A partitioned Parquet table is created in Hive:

{{CREATE EXTERNAL TABLE test_table3 (
  id string
)
PARTITIONED BY (`date` string)
STORED AS parquet;}}

The underlying Parquet files are written by an external engine and physically 
contain both {{id}} and {{date}} columns, while the files are also placed under 
a Hive partition location such as:

{{.../test_table3/date=05172021/}}

The partition is registered in Hive Metastore:

{{ALTER TABLE test_table3
ADD PARTITION (`date`='05172021')
LOCATION '.../test_table3/date=05172021';}}

If the Parquet file contains rows whose physical {{date}} column has values 
different from the Hive partition value, a query such as:

{{SELECT * FROM test_table3 WHERE `date`='05172021';}}

may return different results depending on whether storage-level predicate 
pushdown is enabled.

With storage-level predicate pushdown enabled, the predicate on the Hive 
partition column may be converted into a Parquet {{FilterPredicate}} and 
evaluated against the physical Parquet column. This may filter out rows before 
Hive applies its partition semantics.

With storage-level predicate pushdown disabled, Hive reads the partition 
location and treats all rows in that location as belonging to the registered 
partition value.
h3. Why this should be documented

This is a silent wrong-result risk for users on affected versions. Although the 
root cause is an unsafe data layout, the behavior is exposed through 
{{{}hive.optimize.ppd.storage{}}}, which makes it difficult for users to 
diagnose.

Documenting this caveat would help users on older Hive versions or downstream 
distributions understand the issue and apply one of the safe mitigations:
 * avoid writing Hive partition columns into Parquet files;

 * upgrade to a Hive version containing the HIVE-21599 fix;

 * disable {{hive.optimize.ppd.storage}} as a workaround on affected versions.

h3. Related work
 * HIVE-21599: Wrong results for partitioned Parquet table when files contain 
partition column

 * Commit: {{eb57ac9a0aef456f25b559a4ac225ac9ebf40508}}

 * Pull request: apache/hive#3742

 * HIVE-25244


> Document wrong-result risk for Parquet predicate pushdown when files contain 
> partition columns
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-29691
>                 URL: https://issues.apache.org/jira/browse/HIVE-29691
>             Project: Hive
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 3.0.0, 3.1.1, 3.1.2, 3.1.3, 4.0.0-alpha-2
>         Environment: Apache Hive 4.0.0-alpha-2 and older Hive versions before 
> the HIVE-21599 fix.
> The issue is relevant to partitioned Parquet tables where the underlying 
> Parquet files physically contain columns with the same names as Hive 
> partition columns, and storage-level predicate pushdown is enabled.
>            Reporter: ningsh7
>            Priority: Minor
>
> h3. Description
> The current documentation for {{hive.optimize.ppd.storage}} explains that the 
> option controls whether predicates are pushed down into storage handlers, and 
> that it is ignored when {{hive.optimize.ppd}} is false.
> However, it does not mention a historical Parquet-specific caveat that can 
> affect users running older Hive versions or downstream Hive 3.x distributions 
> where the HIVE-21599 fix may not have been backported.
> In older versions, for partitioned Parquet tables, if the underlying Parquet 
> files physically contain columns with the same names as Hive partition 
> columns, predicates on those partition columns may be converted into Parquet 
> {{{}FilterPredicate{}}}s and evaluated against the file-level schema. This 
> can produce wrong results when the file-level values differ from the Hive 
> Metastore partition values.
> This was addressed by HIVE-21599, which excludes partition columns from the 
> schema used to build Parquet {{{}FilterPredicate{}}}s.
> h3. Problem scenario
> A partitioned Parquet table is created in Hive:
> {code:java}
> CREATE EXTERNAL TABLE test_table3 (
> id string
> )
> PARTITIONED BY (`date` string)
> STORED AS parquet;{code}
> The underlying Parquet files are written by an external engine and physically 
> contain both {{id}} and {{date}} columns, while the files are also placed 
> under a Hive partition location such as:
> {code:java}
> .../test_table3/date=05172021/{code}
> The partition is registered in Hive Metastore:
> {code:java}
> ALTER TABLE test_table3
> ADD PARTITION (`date`='05172021')
> LOCATION '.../test_table3/date=05172021';{code}
> If the Parquet file contains rows whose physical {{date}} column has values 
> different from the Hive partition value, a query such as:
> {code:java}
> SELECT * FROM test_table3 WHERE `date`='05172021';{code}
> may return different results depending on whether storage-level predicate 
> pushdown is enabled.
> With storage-level predicate pushdown enabled, the predicate on the Hive 
> partition column may be converted into a Parquet {{FilterPredicate}} and 
> evaluated against the physical Parquet column. This may filter out rows 
> before Hive applies its partition semantics.
> With storage-level predicate pushdown disabled, Hive reads the partition 
> location and treats all rows in that location as belonging to the registered 
> partition value.
> h3. Why this should be documented
> This is a silent wrong-result risk for users on affected versions. Although 
> the root cause is an unsafe data layout, the behavior is exposed through 
> {{{}hive.optimize.ppd.storage{}}}, which makes it difficult for users to 
> diagnose.
> Documenting this caveat would help users on older Hive versions or downstream 
> distributions understand the issue and apply one of the safe mitigations:
>  * avoid writing Hive partition columns into Parquet files;
>  * upgrade to a Hive version containing the HIVE-21599 fix;
>  * disable {{hive.optimize.ppd.storage}} as a workaround on affected versions.
> h3. Related work
>  * HIVE-21599: Wrong results for partitioned Parquet table when files contain 
> partition column
>  * Commit: {{eb57ac9a0aef456f25b559a4ac225ac9ebf40508}}
>  * Pull request: apache/hive#3742
>  * HIVE-25244



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to