这个按照目前的设计,应该不能算是bug,应该是by desigh的。 主要问题还是因为有两层agg,第一层的agg的retract会导致第二层的agg重新计算和下发结果。
dixingxing85 <dixingxin...@163.com>于2020年4月18日 周六上午11:38写道: > 多谢benchao, > 我这个作业的结果预期结果是每天只有一个结果,这个结果应该是越来越大的,比如: > 20200417,86 > 20200417,90 > 20200417,130 > 20200417,131 > > 而不应该是忽大忽小的,数字由大变小,这样的结果需求方肯定不能接受的: > 20200417,90 > 20200417,86 > 20200417,130 > 20200417,86 > 20200417,131 > > 我的疑问是内层的group by产生的retract流,会影响sink吗,我是在sink端打的日志。 > 如果flink支持这种两层group by的话,那这种结果变小的情况应该算是bug吧? > > Sent from my iPhone > > On Apr 18, 2020, at 10:08, Benchao Li <libenc...@gmail.com> wrote: > > > > Hi, > > 这个是支持的哈。 > 你看到的现象是因为group by会产生retract结果,也就是会先发送-[old],再发送+[new]. > 如果是两层的话,就成了: > 第一层-[old], 第二层-[cur], +[old] > 第一层+[new], 第二层[-old], +[new] > > dixingxin...@163.com <dixingxin...@163.com> 于2020年4月18日周六 上午2:11写道: > >> >> Hi all: >> >> 我们有个streaming sql得到的结果不正确,现象是sink得到的数据一会大一会小,*我们想确认下,这是否是个bug, >> 或者flink还不支持这种sql*。 >> 具体场景是:先group by A, B两个维度计算UV,然后再group by A 把维度B的UV sum起来,对应的SQL如下:(A -> >> dt, B -> pvareaid) >> >> SELECT dt, SUM(a.uv) AS uv >> FROM ( >> SELECT dt, pvareaid, COUNT(DISTINCT cuid) AS uv >> FROM streaming_log_event >> WHERE action IN ('action1') >> AND pvareaid NOT IN ('pv1', 'pv2') >> AND pvareaid IS NOT NULL >> GROUP BY dt, pvareaid >> ) a >> GROUP BY dt; >> >> sink接收到的数据对应日志为: >> >> 2020-04-17 22:28:38,727 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(false,0,86,20200417) >> 2020-04-17 22:28:38,727 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(true,0,130,20200417) >> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(false,0,130,20200417) >> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(true,0,86,20200417) >> 2020-04-17 22:28:39,327 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(false,0,86,20200417) >> 2020-04-17 22:28:39,328 INFO groupBy xx -> to: Tuple2 -> Sink: Unnamed >> (1/1) (GeneralRedisSinkFunction.invoke:169) - receive >> data(true,0,131,20200417) >> >> >> 我们使用的是1.7.2, 测试作业的并行度为1。 >> 这是对应的 issue: https://issues.apache.org/jira/browse/FLINK-17228 >> >> >> ------------------------------ >> dixingxin...@163.com >> >> > > -- > > Benchao Li > School of Electronics Engineering and Computer Science, Peking University > Tel:+86-15650713730 > Email: libenc...@gmail.com; libenc...@pku.edu.cn > > -- Benchao Li School of Electronics Engineering and Computer Science, Peking University Tel:+86-15650713730 Email: libenc...@gmail.com; libenc...@pku.edu.cn