Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

Peter Vary via Review Board Thu, 09 May 2019 01:07:04 -0700

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70474/#review215157
-----------------------------------------------------------



Ship it!




Ship It!

- Peter Vary


On máj. 9, 2019, 7:51 de, Marta Kuczora wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70474/
> -----------------------------------------------------------
> 
> (Updated máj. 9, 2019, 7:51 de)
> 
> 
> Review request for hive and Peter Vary.
> 
> 
> Bugs: HIVE-21407
>     https://issues.apache.org/jira/browse/HIVE-21407
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> The idea behind the patch is that for CHAR columns extend the predicate which 
> is pushed to Parquet with an “or” clause which contains the same expression 
> with a padded and a stripped value.
> Example:
> column c is a CHAR(10) type and the search expression is c='apple'
> The predicate which is pushed to Parquet looked like c='apple ' before the 
> patch and it would look like (c='apple ' or c='apple') after the patch.
> Since the value 'apple' is stored in Parquet without padding, the predicate 
> before the patch didn’t return any rows. With the patch it will return the 
> correct row. 
> Since on predicate level, there is no distinction between CHAR or VARCHAR, 
> the predicates for VARCHARs will be changed as well, so the result set 
> returned from Parquet will be wider than before.
> Example:
> A table contains a c VARCHAR(10) column and there is a row where c='apple' 
> and there is an other row where c='apple '. If the search expression is 
> c='apple ', both rows will be returned from Parquet after the patch. But 
> since Hive is doing an additional filtering on the rows returned from 
> Parquet, it won’t be a problem, the result set returned by Hive will contain 
> only the row with the value 'apple '.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/LeafFilterFactory.java 
> be4c0d5 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestParquetRecordReaderWrapper.java
>  0210a0a 
>   
> ql/src/test/org/apache/hadoop/hive/ql/io/parquet/read/TestParquetFilterPredicate.java
>  d464046 
>   ql/src/test/queries/clientpositive/parquet_ppd_char.q 4230d8c 
>   ql/src/test/queries/clientpositive/parquet_ppd_char2.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_ppd_char2.q.out PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/70474/diff/2/
> 
> 
> Testing
> -------
> 
> Added new q test for testing the PPD for char and varchar types. Also 
> extended the unit tests for the 
> ParquetFilterPredicateConverter.toFilterPredicate method.
> 
> The TestParquetRecordReaderWrapper and the TestParquetFilterPredicate are 
> both testing the same thing, the behavior of the 
> ParquetFilterPredicateConverter.toFilterPredicate method. It doesn't make 
> sense to have tests for the same use case in different test classes, so moved 
> the test cases from the TestParquetRecordReaderWrapper to 
> TestParquetFilterPredicate.
> 
> 
> Thanks,
> 
> Marta Kuczora
> 
>

Re: Review Request 70474: HIVE-21407: Parquet predicate pushdown is not working correctly for char column types

Reply via email to