Rajesh Balamohan created HIVE-27159:
---------------------------------------
Summary: Filters are not pushed down for decimal format in Parquet
Key: HIVE-27159
URL: https://issues.apache.org/jira/browse/HIVE-27159
Project: Hive
Issue Type: Improvement
Reporter: Rajesh Balamohan
Decimal filters are not created and pushed down in parquet readers. This causes
latency delays and unwanted row processing in query execution.
It throws exception in runtime and processes more rows.
E.g Q13.
{noformat}
Parquet: (Map 1)
INFO : Task Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
INPUT_RECORDS OUTPUT_RECORDS
INFO :
----------------------------------------------------------------------------------------------
INFO : Map 1 31254.00 0 0
549,181,950 133
INFO : Map 3 0.00 0 0
73,049 365
INFO : Map 4 2027.00 0 0
6,000,000 1,689,919
INFO : Map 5 0.00 0 0
7,200 1,440
INFO : Map 6 517.00 0 0
1,920,800 493,920
INFO : Map 7 0.00 0 0
1,002 1,002
INFO : Reducer 2 18716.00 0 0
133 0
INFO :
----------------------------------------------------------------------------------------------
ORC:
INFO : Task Execution Summary
INFO :
----------------------------------------------------------------------------------------------
INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms)
INPUT_RECORDS OUTPUT_RECORDS
INFO :
----------------------------------------------------------------------------------------------
INFO : Map 1 6556.00 0 0
267,146,063 152
INFO : Map 3 0.00 0 0
10,000 365
INFO : Map 4 2014.00 0 0
6,000,000 1,689,919
INFO : Map 5 0.00 0 0
7,200 1,440
INFO : Map 6 504.00 0 0
1,920,800 493,920
INFO : Reducer 2 3159.00 0 0
152 0
INFO :
----------------------------------------------------------------------------------------------
{noformat}
{noformat}
Map 1
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: (ss_hdemo_sk is not null and ss_addr_sk is not
null and ss_cdemo_sk is not null and ss_store_sk is not null and
((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50)
or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <=
200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit
>= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <=
250))) (type: boolean)
probeDecodeDetails: cacheKey:HASH_MAP_MAPJOIN_112_container,
bigKeyColName:ss_hdemo_sk, smallTablePos:1, keyRatio:5.042575832290721E-6
Statistics: Num rows: 2750380056 Data size: 1321831086472
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (ss_hdemo_sk is not null and ss_addr_sk is not
null and ss_cdemo_sk is not null and ss_store_sk is not null and
((ss_sales_price >= 100) or (ss_sales_price <= 150) or (ss_sales_price >= 50)
or (ss_sales_price <= 100) or (ss_sales_price >= 150) or (ss_sales_price <=
200)) and ((ss_net_profit >= 100) or (ss_net_profit <= 200) or (ss_net_profit
>= 150) or (ss_net_profit <= 300) or (ss_net_profit >= 50) or (ss_net_profit <=
250))) (type: boolean)
Statistics: Num rows: 2500252205 Data size: 1201619783884
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ss_cdemo_sk (type: bigint), ss_hdemo_sk
(type: bigint), ss_addr_sk (type: bigint), ss_store_sk (type: bigint),
ss_quantity (type: int), ss_ext_sales_price (type: decimal(7,2)),
ss_ext_wholesale_cost (type: decimal(7,2)), ss_sold_date_sk (type: bigint),
ss_net_profit BETWEEN 100 AND 200 (type: boolean), ss_net_profit BETWEEN 150
AND 300 (type: boolean), ss_net_profit BETWEEN 50 AND 250 (type: boolean),
ss_sales_price BETWEEN 100 AND 150 (type: boolean), ss_sales_price BETWEEN 50
AND 100 (type: boolean), ss_sales_price BETWEEN 150 AND 200 (type: boolean)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13
Statistics: Num rows: 2500252205 Data size: 714761816164
Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col7 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col2, _col3, _col4,
_col5, _col6, _col8, _col9, _col10, _col11, _col12, _col13
input vertices:
1 Map 3
Statistics: Num rows: 502508168 Data size: 127400492016
Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col2 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col1, _col3, _col4, _col5,
_col6, _col8, _col9, _col10, _col11, _col12, _col13, _col16, _col17, _col18
input vertices:
1 Map 4
Statistics: Num rows: 86972608 Data size: 10207471112
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((_col16 and _col8) or (_col17 and
_col9) or (_col18 and _col10)) (type: boolean)
Statistics: Num rows: 65229456 Data size:
7655603392 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col1 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col0, _col3, _col4, _col5,
_col6, _col11, _col12, _col13, _col20, _col21
input vertices:
1 Map 5
Statistics: Num rows: 13045892 Data size:
260918084 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col3, _col4, _col5, _col6,
_col11, _col12, _col13, _col20, _col21, _col23, _col24, _col25, _col26, _col27,
_col28
input vertices:
1 Map 6
Statistics: Num rows: 3354659 Data size:
147605232 Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((_col23 and _col24 and _col11 and
_col20) or (_col25 and _col26 and _col12 and _col21) or (_col27 and _col28 and
_col13 and _col21)) (type: boolean)
Statistics: Num rows: 628998 Data size:
27676148 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col3 (type: bigint)
1 _col0 (type: bigint)
outputColumnNames: _col4, _col5, _col6
input vertices:
1 Map 7
Statistics: Num rows: 628998 Data size: 228
Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: sum(_col4), count(_col4),
sum(_col5), count(_col5), sum(_col6), count(_col6)
minReductionHashAggr: 0.99
mode: hash
outputColumnNames: _col0, _col1, _col2,
_col3, _col4, _col5
Statistics: Num rows: 1 Data size: 256
Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
null sort order:
sort order:
Statistics: Num rows: 1 Data size: 256
Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col0 (type:
bigint), _col1 (type: bigint), _col2 (type: decimal(17,2)), _col3 (type:
bigint), _col4 (type: decimal(17,2)), _col5 (type: bigint)
{noformat}
Stack:
{noformat}
fail to build predicate filter leaf with
errorsorg.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet
FilterPredicate not supported for DECIMAL
org.apache.hadoop.hive.ql.metadata.HiveException: Conversion to Parquet
FilterPredicate not supported for DECIMAL
at
org.apache.hadoop.hive.ql.io.parquet.LeafFilterFactory.getLeafFilterBuilderByType(LeafFilterFactory.java:210)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.buildFilterPredicateFromPredicateLeaf(ParquetFilterPredicateConverter.java:130)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:111)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:97)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:71)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.translate(ParquetFilterPredicateConverter.java:88)
at
org.apache.hadoop.hive.ql.io.parquet.read.ParquetFilterPredicateConverter.toFilterPredicate(ParquetFilterPredicateConverter.java:57)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setFilter(ParquetRecordReaderBase.java:202)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:139)
at
org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.setupMetadataAndParquetSplit(ParquetRecordReaderBase.java:88)
at
org.apache.hadoop.hive.ql.io.parquet.vector.VectorizedParquetRecordReader.<init>(VectorizedParquetRecordReader.java:178)
at
org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat.getRecordReader(VectorizedParquetInputFormat.java:52)
at
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:87)
at
org.apache.hadoop.hive.ql.io.RecordReaderWrapper.create(RecordReaderWrapper.java:72)
at
org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:460)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203)
at
org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152)
at
org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:437)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:297)
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)
at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84)
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70)
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)