Re: Issues with LAST_VALUE aggregation UDF

Josh Elser Tue, 11 Sep 2018 08:15:34 -0700

The versions you provided in the description make it sound like you'reactually using HDP's distribution of Apache Phoenix, not an officialApache Phoenix release.

Please test against an Apache Phoenix release or contact Hortonworks forsupport. It would not be unheard of that this issue has already been fixed.


On 9/11/18 10:01 AM, Abhishek Gupta wrote:

Hi,
We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQLqueries that serve REST APIs over analytics data. However we are seeinga number of issues/limitations in this function.First, there seems to be a bug where it fails for sparse columns thathave NULLs. Have created a JIRA ticket for that containing the detailsof the exception and steps to reproduce
https://issues.apache.org/jira/browse/PHOENIX-4898
Second, the function works in a way that it ignores all the group bycolumns except the column in the order by clause in finding the lastvalue. For exampleFollowing is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columnsand m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and alast value on m1
    pk1 .   | pk2                       | pk3 .    | pk4            |
    pk5 .   | m1         | m2 | m3
    
------------------------------------------------------------------------------------------------------
| test | 201808010000 | app1 | plat1 | lang1 | 1 | 10 | 100 || test | 201808010000 | app1 | plat1 | lang2 | 2 | 10 | 100 || test | 201808010000 | app1 | plat2 | lang1 | 3 | 10 | 100 || test | 201808010000 | app2 | plat1 | lang1 | 4 | 10 | 100 |
    | test    | 201808010030    | app1    | plat1          | lang1  |
    10           | 10  | 100  |
    | test    | 201808010030    | app1    | plat1          | lang2  |
    20           | 10  | 100  |
    | test    | 201808010030    | app1    | plat2          | lang1  |
    30           | 10  | 100  |
    | test    | 201808010030    | app2    | plat1          | lang1  |
    40           | 10  | 100  |
    | test    | 201808010100    | app1    | plat1          | lang1  |
    100          | 10  | 100  |
    | test    | 201808010100    | app1    | plat1          | lang2  |
    200          | 10  | 100  |
    | test    | 201808010100    | app1    | plat2          | lang1  |
    300          | 10  | 100  |
    | test    | 201808010100    | app2    | plat1          | lang1  |
    400          | 10  | 100  |
    | test    | 201808010130    | app1    | plat1          | lang1  |
    1000         | 10  | 100  |
    | test    | 201808010130    | app1    | plat1          | lang2  |
    2000         | 10  | 100  |
    | test    | 201808010130    | app1    | plat2          | lang1  |
    3000         | 10  | 100  |
    | test    | 201808010130    | app2    | plat1          | lang1  |
    4000         | 10  | 100  |
    | test    | 201808010200    | app1    | plat1          | lang1  |
    10000        | 10  | 100  |
    | test    | 201808010200    | app1    | plat1          | lang2  |
    20000        | 10  | 100  |
    | test    | 201808010200    | app1    | plat2          | lang1  |
    30000        | 10  | 100  |
    | test    | 201808010200    | app2    | plat1          | lang1  |
    40000        | 10  | 100  |


If I run the following query (using all primary key columns in group by)

    select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as
    t,pk3,pk4,pk5, last_value(m1) within group (order by pk2 asc) as oi,
    sum(m2), sum(m3) from test_table group by pk1,t,pk3,pk4,pk5;
I get the correct result for last value as for each pk2 value which isused in order by there is only 1 row
    
+-------------+---------+----------------+--------+--------+----------+----------+
    |      T      | pk3  | pk4  |  pk5  |   OI   | SUM(M2)  | SUM(M3)  |
    
+-------------+---------+----------------+--------+--------+----------+----------+
| 2018080100 | app1 | plat1 | lang1 | 10 | 20 | 200 || 2018080100 | app1 | plat1 | lang2 | 20 | 20 | 200 || 2018080100 | app1 | plat2 | lang1 | 30 | 20 | 200 || 2018080100 | app2 | plat1 | lang1 | 40 | 20 | 200 || 2018080101 | app1 | plat1 | lang1 | 1000 | 20 | 200 || 2018080101 | app1 | plat1 | lang2 | 2000 | 20 | 200 || 2018080101 | app1 | plat2 | lang1 | 3000 | 20 | 200 || 2018080101 | app2 | plat1 | lang1 | 4000 | 20 | 200 || 2018080102 | app1 | plat1 | lang1 | 10000 | 20 | 100 || 2018080102 | app1 | plat1 | lang2 | 20000 | 10 | 100 || 2018080102 | app1 | plat2 | lang1 | 30000 | 10 | 100 || 2018080102 | app2 | plat1 | lang1 | 40000 | 10 | 100 |
    
+-------------+---------+----------------+--------+--------+----------+----------+
However if I do I group by on less than all the primary columns theLAST_VALUE function ignores the rest of the group by columns in sortingand returns incorrect last_value
    select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3,
    last_value(m1) within group (order by pk2 asc) as oi, sum(m2),
    sum(m3) from test_table group by pk1,t,pk3;


    +-------------+---------+--------+----------+----------+
    |      T      | pk3  |   OI   | SUM(M2)  | SUM(M3)  |
    +-------------+---------+--------+----------+----------+
    | 2018080100  | app1    | 10     | 60       | 600      |
    | 2018080100  | app2    | 40     | 20       | 200      |
    | 2018080101  | app1    | 1000   | 60       | 600      |
    | 2018080101  | app2    | 4000   | 20       | 200      |
    | 2018080102  | app1    | 10000  | 40       | 300      |
    | 2018080102  | app2    | 40000  | 10       | 100      |
+-------------+---------+--------+----------+----------+So instead of taking the last value of the group formed by 2018080100and app1 i.e 30 it is picking the first i.e 10.
  Thanks,
Abhishek

Re: Issues with LAST_VALUE aggregation UDF

Reply via email to