Issues with LAST_VALUE aggregation UDF

Abhishek Gupta Tue, 11 Sep 2018 07:02:01 -0700

Hi,

We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL
queries that serve REST APIs over analytics data. However we are seeing a
number of issues/limitations in this function.
First, there seems to be a bug where it fails for sparse columns that have
NULLs. Have created a JIRA ticket for that containing the details of the
exception and steps to reproduce
https://issues.apache.org/jira/browse/PHOENIX-4898


Second, the function works in a way that it ignores all the group by
columns except the column in the order by clause in finding the last value.
For example
Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and
m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last
value on m1

pk1 .   | pk2                       | pk3 .    | pk4            | pk5 .   |
> m1         | m2 | m3
>
> ------------------------------------------------------------------------------------------------------
> | test    | 201808010000    | app1    | plat1          | lang1  | 1
>     | 10  | 100  |
> | test    | 201808010000    | app1    | plat1          | lang2  | 2
>     | 10  | 100  |
> | test    | 201808010000    | app1    | plat2          | lang1  | 3
>     | 10  | 100  |
> | test    | 201808010000    | app2    | plat1          | lang1  | 4
>     | 10  | 100  |
> | test    | 201808010030    | app1    | plat1          | lang1  | 10
>      | 10  | 100  |
> | test    | 201808010030    | app1    | plat1          | lang2  | 20
>      | 10  | 100  |
> | test    | 201808010030    | app1    | plat2          | lang1  | 30
>      | 10  | 100  |
> | test    | 201808010030    | app2    | plat1          | lang1  | 40
>      | 10  | 100  |
> | test    | 201808010100    | app1    | plat1          | lang1  | 100
>     | 10  | 100  |
> | test    | 201808010100    | app1    | plat1          | lang2  | 200
>     | 10  | 100  |
> | test    | 201808010100    | app1    | plat2          | lang1  | 300
>     | 10  | 100  |
> | test    | 201808010100    | app2    | plat1          | lang1  | 400
>     | 10  | 100  |
> | test    | 201808010130    | app1    | plat1          | lang1  | 1000
>      | 10  | 100  |
> | test    | 201808010130    | app1    | plat1          | lang2  | 2000
>      | 10  | 100  |
> | test    | 201808010130    | app1    | plat2          | lang1  | 3000
>      | 10  | 100  |
> | test    | 201808010130    | app2    | plat1          | lang1  | 4000
>      | 10  | 100  |
> | test    | 201808010200    | app1    | plat1          | lang1  | 10000
>     | 10  | 100  |
> | test    | 201808010200    | app1    | plat1          | lang2  | 20000
>     | 10  | 100  |
> | test    | 201808010200    | app1    | plat2          | lang1  | 30000
>     | 10  | 100  |
> | test    | 201808010200    | app2    | plat1          | lang1  | 40000
>     | 10  | 100  |


If I run the following query (using all primary key columns in group by)

select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t,pk3,pk4,pk5,
> last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from
> test_table group by pk1,t,pk3,pk4,pk5;


I get the correct result for last value as for each pk2 value which is used
in order by there is only 1 row

>
> +-------------+---------+----------------+--------+--------+----------+----------+
> |      T      | pk3  | pk4  |  pk5  |   OI   | SUM(M2)  | SUM(M3)  |
>
> +-------------+---------+----------------+--------+--------+----------+----------+
> | 2018080100  | app1    | plat1          | lang1  | 10     | 20       |
> 200      |
> | 2018080100  | app1    | plat1          | lang2  | 20     | 20       |
> 200      |
> | 2018080100  | app1    | plat2          | lang1  | 30     | 20       |
> 200      |
> | 2018080100  | app2    | plat1          | lang1  | 40     | 20       |
> 200      |
> | 2018080101  | app1    | plat1          | lang1  | 1000   | 20       |
> 200      |
> | 2018080101  | app1    | plat1          | lang2  | 2000   | 20       |
> 200      |
> | 2018080101  | app1    | plat2          | lang1  | 3000   | 20       |
> 200      |
> | 2018080101  | app2    | plat1          | lang1  | 4000   | 20       |
> 200      |
> | 2018080102  | app1    | plat1          | lang1  | 10000  | 20       |
> 100      |
> | 2018080102  | app1    | plat1          | lang2  | 20000  | 10       |
> 100      |
> | 2018080102  | app1    | plat2          | lang1  | 30000  | 10       |
> 100      |
> | 2018080102  | app2    | plat1          | lang1  | 40000  | 10       |
> 100      |
>
> +-------------+---------+----------------+--------+--------+----------+----------+


However if I do I group by on less than all the primary columns the
LAST_VALUE function ignores the rest of the group by columns in sorting and
returns incorrect last_value

select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3,
> last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from
> test_table group by pk1,t,pk3;


+-------------+---------+--------+----------+----------+
> |      T      | pk3  |   OI   | SUM(M2)  | SUM(M3)  |
> +-------------+---------+--------+----------+----------+
> | 2018080100  | app1    | 10     | 60       | 600      |
> | 2018080100  | app2    | 40     | 20       | 200      |
> | 2018080101  | app1    | 1000   | 60       | 600      |
> | 2018080101  | app2    | 4000   | 20       | 200      |
> | 2018080102  | app1    | 10000  | 40       | 300      |
> | 2018080102  | app2    | 40000  | 10       | 100      |
> +-------------+---------+--------+----------+----------+


So instead of taking the last value of the group formed by  2018080100 and
app1 i.e 30 it is picking the first i.e 10.

 Thanks,
Abhishek

Issues with LAST_VALUE aggregation UDF

Reply via email to