Chaozhong Yang created HIVE-15872:
-------------------------------------
Summary: The PERCENTILE UDAF does not work with empty set
Key: HIVE-15872
URL: https://issues.apache.org/jira/browse/HIVE-15872
Project: Hive
Issue Type: Bug
Components: UDF
Reporter: Chaozhong Yang
Assignee: Chaozhong Yang
Fix For: 2.1.2
1. Original SQL:
select
percentile_approx(
column0,
array(0.50, 0.70, 0.90, 0.95, 0.99)
)
from
my_table
where
date = '20170207'
and column1 = 'value1'
and column2 = 'value2'
and column3 = 'value3'
and column4 = 'value4'
and column5 = 'value5'
2. Exception StackTrace:
Error: java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at
org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at
java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by:
org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while
processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at
org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) ...
7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766)
at org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235)
... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at
java.util.ArrayList.rangeCheck(ArrayList.java:653) at
java.util.ArrayList.get(ArrayList.java:429) at
org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134)
at
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318)
at
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695)
at
org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761)
... 8 more
3. review data:
select
column0
from
my_table
where
date = '20170207'
and column1 = 'value1'
and column2 = 'value2'
and column3 = 'value3'
and column4 = 'value4'
and column5 = 'value5'
After run this sql, we found the result is NULL.
4. what's the meaning of [0.0, 10000.0] in stacktrace?
In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an
ArrayList which name is partialHistogram. Normally, the basic structure of
partialHistogram is [npercentiles, percentile0, percentile1..., nbins, bin0.x,
bin0.y, bin1.x, bin1.y,...]. However, if we are process NULL(empty set) column
values, the partialHistoram will only contains [npercentiles(0), nbins(10000)].
That's the reason why the stacktrace shows a strange row data:
{"key":{},"value":{"_col0":[0.0,10000.0]}}
Before we call histogram#merge (on-line hisgoram algorithm from paper:
http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the
partialHistogram should remove elements which store percentiles like
`partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty set,
GenericUDAFPercentileApproxEvaluator will not remove percentiles. Consequently,
NumericHistogram will merge a list which contains only 2 elements([0, 10000.0])
and throws IndexOutOfBoundsException.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)