Hi, We are trying to use the LAST_VALUE aggregation UDF in our Phoenix SQL queries that serve REST APIs over analytics data. However we are seeing a number of issues/limitations in this function. First, there seems to be a bug where it fails for sparse columns that have NULLs. Have created a JIRA ticket for that containing the details of the exception and steps to reproduce https://issues.apache.org/jira/browse/PHOENIX-4898
Second, the function works in a way that it ignores all the group by columns except the column in the order by clause in finding the last value. For example Following is a dataset (pk1,pk2,pk3,pk4 and pk5 are Primary key columns and m1-m3 are metric columns, I intend to do a SUM on m2 and m3 and a last value on m1 pk1 . | pk2 | pk3 . | pk4 | pk5 . | > m1 | m2 | m3 > > ------------------------------------------------------------------------------------------------------ > | test | 201808010000 | app1 | plat1 | lang1 | 1 > | 10 | 100 | > | test | 201808010000 | app1 | plat1 | lang2 | 2 > | 10 | 100 | > | test | 201808010000 | app1 | plat2 | lang1 | 3 > | 10 | 100 | > | test | 201808010000 | app2 | plat1 | lang1 | 4 > | 10 | 100 | > | test | 201808010030 | app1 | plat1 | lang1 | 10 > | 10 | 100 | > | test | 201808010030 | app1 | plat1 | lang2 | 20 > | 10 | 100 | > | test | 201808010030 | app1 | plat2 | lang1 | 30 > | 10 | 100 | > | test | 201808010030 | app2 | plat1 | lang1 | 40 > | 10 | 100 | > | test | 201808010100 | app1 | plat1 | lang1 | 100 > | 10 | 100 | > | test | 201808010100 | app1 | plat1 | lang2 | 200 > | 10 | 100 | > | test | 201808010100 | app1 | plat2 | lang1 | 300 > | 10 | 100 | > | test | 201808010100 | app2 | plat1 | lang1 | 400 > | 10 | 100 | > | test | 201808010130 | app1 | plat1 | lang1 | 1000 > | 10 | 100 | > | test | 201808010130 | app1 | plat1 | lang2 | 2000 > | 10 | 100 | > | test | 201808010130 | app1 | plat2 | lang1 | 3000 > | 10 | 100 | > | test | 201808010130 | app2 | plat1 | lang1 | 4000 > | 10 | 100 | > | test | 201808010200 | app1 | plat1 | lang1 | 10000 > | 10 | 100 | > | test | 201808010200 | app1 | plat1 | lang2 | 20000 > | 10 | 100 | > | test | 201808010200 | app1 | plat2 | lang1 | 30000 > | 10 | 100 | > | test | 201808010200 | app2 | plat1 | lang1 | 40000 > | 10 | 100 | If I run the following query (using all primary key columns in group by) select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t,pk3,pk4,pk5, > last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from > test_table group by pk1,t,pk3,pk4,pk5; I get the correct result for last value as for each pk2 value which is used in order by there is only 1 row > > +-------------+---------+----------------+--------+--------+----------+----------+ > | T | pk3 | pk4 | pk5 | OI | SUM(M2) | SUM(M3) | > > +-------------+---------+----------------+--------+--------+----------+----------+ > | 2018080100 | app1 | plat1 | lang1 | 10 | 20 | > 200 | > | 2018080100 | app1 | plat1 | lang2 | 20 | 20 | > 200 | > | 2018080100 | app1 | plat2 | lang1 | 30 | 20 | > 200 | > | 2018080100 | app2 | plat1 | lang1 | 40 | 20 | > 200 | > | 2018080101 | app1 | plat1 | lang1 | 1000 | 20 | > 200 | > | 2018080101 | app1 | plat1 | lang2 | 2000 | 20 | > 200 | > | 2018080101 | app1 | plat2 | lang1 | 3000 | 20 | > 200 | > | 2018080101 | app2 | plat1 | lang1 | 4000 | 20 | > 200 | > | 2018080102 | app1 | plat1 | lang1 | 10000 | 20 | > 100 | > | 2018080102 | app1 | plat1 | lang2 | 20000 | 10 | > 100 | > | 2018080102 | app1 | plat2 | lang1 | 30000 | 10 | > 100 | > | 2018080102 | app2 | plat1 | lang1 | 40000 | 10 | > 100 | > > +-------------+---------+----------------+--------+--------+----------+----------+ However if I do I group by on less than all the primary columns the LAST_VALUE function ignores the rest of the group by columns in sorting and returns incorrect last_value select TO_CHAR(TO_DATE(pk2,'yyyyMMddHHmm'),'yyyyMMddHH') as t, pk3, > last_value(m1) within group (order by pk2 asc) as oi, sum(m2), sum(m3) from > test_table group by pk1,t,pk3; +-------------+---------+--------+----------+----------+ > | T | pk3 | OI | SUM(M2) | SUM(M3) | > +-------------+---------+--------+----------+----------+ > | 2018080100 | app1 | 10 | 60 | 600 | > | 2018080100 | app2 | 40 | 20 | 200 | > | 2018080101 | app1 | 1000 | 60 | 600 | > | 2018080101 | app2 | 4000 | 20 | 200 | > | 2018080102 | app1 | 10000 | 40 | 300 | > | 2018080102 | app2 | 40000 | 10 | 100 | > +-------------+---------+--------+----------+----------+ So instead of taking the last value of the group formed by 2018080100 and app1 i.e 30 it is picking the first i.e 10. Thanks, Abhishek