mxm commented on PR #726:
URL: 
https://github.com/apache/flink-kubernetes-operator/pull/726#issuecomment-1855592264

   >However I am personally very much in favour of enabling it at least for the 
GC time for the following reason:
   
   >When you spend a considerable time in GC (30% is actually a lot) the 
processing is probably extremely slow compared to a state where you have normal 
gc time. In my experience when you are above 30% GC you are usually way above 
it and the degradation is massive, in most of these cases processing comes to 
almost a complete halt relatively speaking.
   
   >This means that true processing rate and other time based measurements are 
completely off (very low compared to the actual value without memory pressure). 
So the scale up in this case would be very much overshooting, risking a large 
resource / cost spike that is basically 100% wrong. I think this is a big 
production risk that can affect trust in the autoscaler.
   
   I fully understand the rational for blocking scaling decisions but I think I 
draw different conclusions from these scenarios.
   
   Consider the following scenario: We scaled down to parallelism 1 at night 
because there was no traffic. In the morning, either immediately, or as traffic 
ramps up and we scale up gradually, we get trapped by the GC feature. I would 
prefer overscaling as opposed to getting stuck. The scaling would be capped in 
resources either by the existing max parallelism setting or by an upcoming 
fairness feature which will allow allocating only a fraction of the max cluster 
resources. 
   
   I believe this PR is part of a solution to handle GC pressure / heap memory 
issues, but I'm not convinced blocking scaling is the behavior I would like to 
see. Ultimately, memoization of the processing capacity, processing rate, and 
GC pressure at each parallelism would help to build a simple model of what kind 
of scaling decisions to prevent. For example, if we end up in a high GC 
scenario under a certain input rate, we would block scaling to that parallelism 
(or a close parallelism). I feel quite strong about never blocking scaling 
because this is the worst kind of outcome for the user.
   
   I'm ok with merging this feature but I would make sure it is disabled for us 
until we address the above concerns. Generally, I wonder whether it makes sense 
to go back to the drawing board for these kind of impactful features. It is 
good to have discussions upfront because the implementation can be a lot of 
work. Again, nothing wrong with this change, thank you for making it possible, 
but I feel like it needs additional work to deliver the full potential value to 
users.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to