[ https://issues.apache.org/jira/browse/HIVE-15872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15860908#comment-15860908 ]
Chaozhong Yang edited comment on HIVE-15872 at 2/10/17 8:30 AM: ---------------------------------------------------------------- According to SQL-standard, percentile_approx should return null for zero elements rather than throws IndexOutOfBoundsException in reducer. was (Author: debugger87): According to SQL-standard, percentile_approx should return null for zero elements rather than throws IndexOutOfBoundsException in reduce. ``` @Override public Object terminate(AggregationBuffer agg) throws HiveException { PercentileAggBuf myagg = (PercentileAggBuf) agg; if (myagg.histogram.getUsedBins() < 1) { // SQL standard - return null for zero elements return null; } else { assert(myagg.quantiles != null); return new DoubleWritable(myagg.histogram.quantile(myagg.quantiles[0])); } } ``` > The PERCENTILE UDAF does not work with empty set > ------------------------------------------------ > > Key: HIVE-15872 > URL: https://issues.apache.org/jira/browse/HIVE-15872 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Chaozhong Yang > Assignee: Chaozhong Yang > Fix For: 2.1.2 > > Attachments: HIVE-15872.patch > > > 1. Original SQL: > select > percentile_approx( > column0, > array(0.50, 0.70, 0.90, 0.95, 0.99) > ) > from > my_table > where > date = '20170207' > and column1 = 'value1' > and column2 = 'value2' > and column3 = 'value3' > and column4 = 'value4' > and column5 = 'value5' > 2. Exception StackTrace: > Error: java.lang.RuntimeException: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at > org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:256) at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:453) at > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:401) at > org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at > java.security.AccessController.doPrivileged(Native Method) at > javax.security.auth.Subject.doAs(Subject.java:422) at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: > org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while > processing row (tag=0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at > org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) > ... 7 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 at > org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:766) > at > org.apache.hadoop.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:235) > ... 7 more Caused by: java.lang.IndexOutOfBoundsException: Index: 2, Size: 2 > at java.util.ArrayList.rangeCheck(ArrayList.java:653) at > java.util.ArrayList.get(ArrayList.java:429) at > org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHistogram.java:134) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentileApprox.java:318) > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.aggregate(GenericUDAFEvaluator.java:188) > at > org.apache.hadoop.hive.ql.exec.GroupByOperator.updateAggregations(GroupByOperator.java:612) > at > org.apache.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:851) > at > org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOperator.java:695) > at > org.apache.hadoop.hive.ql.exec.GroupByOperator.process(GroupByOperator.java:761) > ... 8 more > 3. review data: > select > column0 > from > my_table > where > date = '20170207' > and column1 = 'value1' > and column2 = 'value2' > and column3 = 'value3' > and column4 = 'value4' > and column5 = 'value5' > After run this sql, we found the result is NULL. > 4. what's the meaning of [0.0, 10000.0] in stacktrace? > In GenericUDAFPercentileApproxEvaluator, the method `merge` should process an > ArrayList which name is partialHistogram. Normally, the basic structure of > partialHistogram is [npercentiles, percentile0, percentile1..., nbins, > bin0.x, bin0.y, bin1.x, bin1.y,...]. However, if we are process NULL(empty > set) column values, the partialHistoram will only contains [npercentiles(0), > nbins(10000)]. That's the reason why the stacktrace shows a strange row data: > {"key":{},"value":{"_col0":[0.0,10000.0]}} > Before we call histogram#merge (on-line hisgoram algorithm from paper: > http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the > partialHistogram should remove elements which store percentiles like > `partialHistogram.subList(0, nquantiles+1).clear();`. In the case of empty > set, GenericUDAFPercentileApproxEvaluator will not remove percentiles. > Consequently, NumericHistogram will merge a list which contains only 2 > elements([0, 10000.0]) and throws IndexOutOfBoundsException. -- This message was sent by Atlassian JIRA (v6.3.15#6346)