Rajesh Balamohan created HIVE-27159: ---------------------------------------
Summary: Filters are not pushed down for decimal format in Parquet Key: HIVE-27159 URL: https://issues.apache.org/jira/browse/HIVE-27159 Project: Hive Issue Type: Improvement Reporter: Rajesh Balamohan Decimal filters are not created and pushed down in parquet readers. This causes latency delays and unwanted row processing in query execution. It throws exception in runtime and processes more rows. E.g Q13. {noformat} Parquet: (Map 1) INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 31254.00 0 0 549,181,950 133 INFO : Map 3 0.00 0 0 73,049 365 INFO : Map 4 2027.00 0 0 6,000,000 1,689,919 INFO : Map 5 0.00 0 0 7,200 1,440 INFO : Map 6 517.00 0 0 1,920,800 493,920 INFO : Map 7 0.00 0 0 1,002 1,002 INFO : Reducer 2 18716.00 0 0 133 0 INFO : ---------------------------------------------------------------------------------------------- ORC: INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 6556.00 0 0 267,146,063 152 INFO : Map 3 0.00 0 0 10,000 365 INFO : Map 4 2014.00 0 0 6,000,000 1,689,919 INFO : Map 5 0.00 0 0 7,200 1,440 INFO : Map 6 504.00 0 0 1,920,800 493,920 INFO : Reducer 2 3159.00 0 0 152 0 INFO : ---------------------------------------------------------------------------------------------- {noformat} {noformat} Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean) probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container, bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6 Statistics: Num rows: 2750380056 Data size: 1321831086472 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: (ss_hdemo_sk is not null and ss_addr_sk is not null and ss_cdemo_sk is not null and ss_store_sk is not null and ((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50) or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <= 200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit >= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <= 250))) (type: boolean) Statistics: Num rows: 2500252205 Data size: 1201619783884 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk (type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint), ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150 AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean), ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50 AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13 Statistics: Num rows: 2500252205 Data size: 714761816164 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col7 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13 input vertices: 1 Map 3 Statistics: Num rows: 502508168 Data size: 127400492016 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col2 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col0, _col1, _col3, _col4, _col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13, _col16, _col17, _col18 input vertices: 1 Map 4 Statistics: Num rows: 86972608 Data size: 10207471112 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((_col16 and _col8) or (_col17 and _col9) or (_col18 and _col10)) (type: boolean) Statistics: Num rows: 65229456 Data size: 7655603392 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col1 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col0, _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21 input vertices: 1 Map 5 Statistics: Num rows: 13045892 Data size: 260918084 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col3, _col4, _col5, _col6, _col11, _col12, _col13, _col20, _col21, _col23, _col24, _col25, _col26, _col27, _col28 input vertices: 1 Map 6 Statistics: Num rows: 3354659 Data size: 147605232 Basic stats: COMPLETE Column stats: COMPLETE Filter Operator predicate: ((_col23 and _col24 and _col11 and _col20) or (_col25 and _col26 and _col12 and _col21) or (_col27 and _col28 and _col13 and _col21)) (type: boolean) Statistics: Num rows: 628998 Data size: 27676148 Basic stats: COMPLETE Column stats: COMPLETE Map Join Operator condition map: Inner Join 0 to 1 keys: 0 _col3 (type: bigint) 1 _col0 (type: bigint) outputColumnNames: _col4, _col5, _col6 input vertices: 1 Map 7 Statistics: Num rows: 628998 Data size: 228 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator aggregations: sum(_col4), count(_col4), sum(_col5), count(_col5), sum(_col6), count(_col6) minReductionHashAggr: 0.99 mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator null sort order: sort order: Statistics: Num rows: 1 Data size: 256 Basic stats: COMPLETE Column stats: COMPLETE value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type: decimal(17,2)), _col3 (type: bigint), _col4 (type: decimal(17,2)), _col5 (type: bigint) {noformat} Stack: {noformat} fail to build predicate filter leaf with errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet FilterPredicate not supported for DECIMAL at org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:210) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:202) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:139) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setupMetadataAndParquetSplit(ParquetRecordReaderBase.java:88) at org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:178) at org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:52) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87) at org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:460) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:297) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70) at java.base/java.security.AccessController.doPrivileged(Native Method) at java.base/javax.security.auth.Subject.doAs(Subject.java:423) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)