mxm commented on PR #726: URL: https://github.com/apache/flink-kubernetes-operator/pull/726#issuecomment-1855592264
>However I am personally very much in favour of enabling it at least for the GC time for the following reason: >When you spend a considerable time in GC (30% is actually a lot) the processing is probably extremely slow compared to a state where you have normal gc time. In my experience when you are above 30% GC you are usually way above it and the degradation is massive, in most of these cases processing comes to almost a complete halt relatively speaking. >This means that true processing rate and other time based measurements are completely off (very low compared to the actual value without memory pressure). So the scale up in this case would be very much overshooting, risking a large resource / cost spike that is basically 100% wrong. I think this is a big production risk that can affect trust in the autoscaler. I fully understand the rational for blocking scaling decisions but I think I draw different conclusions from these scenarios. Consider the following scenario: We scaled down to parallelism 1 at night because there was no traffic. In the morning, either immediately, or as traffic ramps up and we scale up gradually, we get trapped by the GC feature. I would prefer overscaling as opposed to getting stuck. The scaling would be capped in resources either by the existing max parallelism setting or by an upcoming fairness feature which will allow allocating only a fraction of the max cluster resources. I believe this PR is part of a solution to handle GC pressure / heap memory issues, but I'm not convinced blocking scaling is the behavior I would like to see. Ultimately, memoization of the processing capacity, processing rate, and GC pressure at each parallelism would help to build a simple model of what kind of scaling decisions to prevent. For example, if we end up in a high GC scenario under a certain input rate, we would block scaling to that parallelism (or a close parallelism). I feel quite strong about never blocking scaling because this is the worst kind of outcome for the user. I'm ok with merging this feature but I would make sure it is disabled for us until we address the above concerns. Generally, I wonder whether it makes sense to go back to the drawing board for these kind of impactful features. It is good to have discussions upfront because the implementation can be a lot of work. Again, nothing wrong with this change, thank you for making it possible, but I feel like it needs additional work to deliver the full potential value to users. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org