Hi,

I know that there is an improvement in Blink SQL that can deal with the top
k problem like SQL
showed below by maintaining an in-memory "heap" to store top k records.
That is not a problem
when user's score will only grow up.

> SELECT user_id, score
> FROM (
>   SELECT *,
>     ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
>   FROM user_scores)
> WHERE row_num <= 3
>
>
My question is how to deal with such top k problem when user's score will
decrease as well.
Suppose there is a SQL like this.

> SELECT user_id, score
> FROM (
>   SELECT *,
>     ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num
>   FROM (
>       SELECT user_id, LAST_VAL(score) AS score
>       FROM user_scores
>       GROUP BY user_id))
> WHERE row_num <= 3
>
> `user_scores` is a dynamic table converted from a DataStream, `LAST_VAL`
is a UDAF to get
the latest value. So, the user's score will increase but also decrease from
time to time.

So, if we only maintain a heap to store top k elements, there will be a
problem. For example, if
there is already three users: A, B, C with score: 4, 3, 2 stored in a top-3
heap. If the next record
is a D user with score 1, it will be dropped due to the score is less than
A, B and C. However, if
the next record comes after that with an updated score 0 for user A. In
reality, we know that
top-3 users will become B, C and D, but it is no chance to get user D back
if using heap in this
case. Using heap works fine if it is running on batch mode because the
users' score won't
change from time to time.

In this case, I think it should fall back to store all users and their
scores. Update top-k every time
when receive a new record. If the heap optimization won't work here in
streaming mode, is there
any other optimization can apply in this case? It is not necessary to focus
on SQL only. Any
improvement on DataStream is also welcome. Thank you.

Best Regards,
Tony Wei

Reply via email to