[ 
https://issues.apache.org/jira/browse/FLINK-25470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520267#comment-17520267
 ] 

Hangxiang Yu commented on FLINK-25470:
--------------------------------------

I think we may don't need to expose these changelog metics into Flink UI in 
first step, but need to expose them by REST API so that we could see the 
complete metrics by some visualization tools, e.g. grafana. It is meaningful to 
check whether it works well by metrics of different parts for different jobs. I 
think how to expose them to Flink UI deserves further discussion.



After FLINK-25557, IIUC, We have two metrics:
 # checkpointed size. For Changelog, it refers to incremental size of 
non-materialization part.
 # full size. For Changelog, it refers to full size of all parts of 
materialization and non-materialization.


In my opinion, we may need to expose:
 # incremental size of materialization part (positive if updated by 
materialization, zero otherwise).
 # full size of materialization part.
 # full size of non-materialization part (It also could be infered by full size 
and full size of materialization part).


According to these metics, we could roughly infer:
 # restore time by full size of materialization part and non-materialization 
part. 
 # when a checkpoint includes a new Materialization by incremetal/full size of 
materialization part.
 # the cleanup efficiency of non-materialization part by compare the full size 
of non-materialization part which is the real size and the actual size in the 
dfs.


I also think "How much Data Size increases/exploding" have been answered by 
current "full size".

I think other metrics [~ym]  metioned could be seen in the above.



BTW, I also think whether we need to expose "async duration of materialization 
part". 

Current "async duration" refers to the asunc duration of incremental checkpoint 
of non-materialization part.

If we expose "async duration of materialization part", we could see whether the 
materialization part will affect the job.

[~ym] [~roman] WDYT?

> Add/Expose/Differentiate metrics of checkpoint size between changelog size vs 
> materialization size
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-25470
>                 URL: https://issues.apache.org/jira/browse/FLINK-25470
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Metrics, Runtime / State Backends
>            Reporter: Yuan Mei
>            Priority: Major
>             Fix For: 1.16.0
>
>         Attachments: Screen Shot 2021-12-29 at 1.09.48 PM.png
>
>
> FLINK-25557  only resolves part of the problems. 
> Eventually, we should answer questions:
>  * How much Data Size increases/exploding
>  * When a checkpoint includes a new Materialization
>  * Materialization size
>  * changelog sizes from the last complete checkpoint (that can roughly infer 
> restore time)
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to