[ 
https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841569#comment-16841569
 ] 

David Lavati commented on HIVE-21709:
-------------------------------------

These steps along with an apache commiter's +1 approval on this ticket - which 
is kind of the result of the review - are needed to get it merged.

> Count with expression does not work in Parquet
> ----------------------------------------------
>
>                 Key: HIVE-21709
>                 URL: https://issues.apache.org/jira/browse/HIVE-21709
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.2
>            Reporter: Mainak Ghosh
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For parquet file with nested schema, count with expression as column name 
> does not work when you are filtering on another column in the same struct. 
> Here are the steps to reproduce:
> {code:java}
> CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, 
> `pub_id`:string>) ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS 
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', 
> 'pub_id', '2');
> select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> +------+ 
> | _c0  |
> +------+ 
> | 0    | 
> +------+
> select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases. 
> +------+ 
> | _c0  | 
> +------+ 
> | 1    | 
> +------+{code}
> As you can see the first query returns the wrong result while the second one 
> returns the correct result.
> The issue is an column order mismatch between the actual parquet file 
> (impression_id first and pub_id second) and the Hive prunedCols datastructure 
> (reverse). As a result in the filter we compare with the wrong value and the 
> count returns 0. I have been able to identify the cause of this mismatch.
> I would love to get the code reviewed and merged. Some of the code changes 
> are changes to commits from Ferdinand Xu and Chao Sun.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to