Hi,

I’m not sure if there is some simple way of doing that (maybe some other 
contributors will know more).

There are two potential ideas worth exploring:
- use periodically triggered save points for monitoring? If I remember 
correctly save points are never incremental
- use save point input/output format to analyse the content of the save point? 
[1]

I hope that someone else from the community will be able to help more here.

Piotrek

[1] https://flink.apache.org/feature/2019/09/13/state-processor-api.html 
<https://flink.apache.org/feature/2019/09/13/state-processor-api.html>

> On 22 Nov 2019, at 22:48, Aaron Langford <aaron.langfor...@gmail.com> wrote:
> 
> Hey Flink Community,
> 
> I'm working on a Flink application where we are implementing operators that 
> extend the RichFlatMap and RichCoFlatMap interfaces. As a result, we are 
> working directly with Flink's state API (ValueState, ListState, MapState). 
> Something that appears to be extremely valuable is having a way to monitor 
> the state size for each operator. My team has already run into a few cases 
> where our state has exploded and jobs fail because YARN kills containers who 
> are exceeding their memory limits.
> 
> It is my understanding that the way to best monitor this kind of thing by 
> watching checkpoint size per operator instance. This gets a little confusing 
> when doing incremental check-pointing because the numbers reported seem to be 
> a delta in state size, not the actual state size at that point in time. For 
> my teams application, the total state size is not the sum of those deltas. 
> What is the best way to get the total size of a checkpoint per operator for 
> each checkpoint?
> 
> Additionally, monitoring de-serializing and serializing state in a Flink 
> application is something that I haven't seen a great story for yet. It seems 
> that some of the really badly written Flink operators tend to do most poorly 
> when they demand lots of serde for each record. So giving visibility into how 
> well an application is handling these types of operations seems to be a 
> valuable guard rail for flink developers. Does anyone have existing solutions 
> for this, or are there pointers to some work that can be done to improve this 
> story?
> 
> Aaron

Reply via email to