On Mon, 6 Apr 2026 18:05:22 -0700 SeongJae Park <[email protected]> wrote:
Hi SJ, > TL; DR: Let users set different DAMOS quota charge ratios for DAMOS > action failed regions, for deterministic and consistent DAMOS action > progress. > > Common Reports: Unexpectedly Slow DAMOS > ======================================= > > One common issue report that we get from DAMON users is that DAMOS > action applying progress speed is sometimes much slower than expected. > And one common root cause is that the DAMOS quota is exceeded by the > action applying failed memory regions. > > For example, a group of users tried to run DAMOS-based proactive memory > reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota. They > ran it on a system having no active workload which means all memory of > the system is cold. The expectation was that the system will show 100 > MiB per second reclamation until (nearly) all memory is reclaimed. But > what they found is that the speed is quite inconsistent and sometimes it > becomes very slower than the expectation, sometimes even no reclamation > at all for about tens of seconds. The upper limit of the speed (100 MiB > per second) was being kept as expected, though. > > By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS > stat, we found DAMOS quota is always exceeded when the speed is slow. By > monitoring sz_tried and sz_applied (the total amount of DAMOS action > tried memory and succeeded memory) DAMOS stats together, we found the > reclamation attempts nearly always failed when the speed is slow. > > DAMOS quota charges DAMOS action tried regions regardless of the > successfulness of the try. Hence in the example reported case, there > was unreclaimable memory spread around the system memory. Sometimes > nearly 100 MiB of memory that DAMOS tried to reclaim in the given quota > interval was reclaimable, and therefore showed nearly 100 MiB per second > speed. Sometimes nearly 99 MiB of memory that DAMOS was trying to > reclaim in the given quota interval was unreclaimable, and therefore > showing only about 1 MiB per second reclaim speed. > > We explained it is an expected behavior of the feature rather than a > bug, as DAMOS quota is there for only the upper-limit of the speed. The > users agreed and later reported a huge win from the adoption of > DAMON_RECLAIM on their products. Thanks for this series. This is a problem I have come across and am looking forward to seeing this land. > It is Not a Bug but a Feature; But... > ===================================== > > So nothing is broken. DAMOS quota is working as intended, as the upper > limit of the speed. It also provides its behavior observability via > DAMOS stat. In the real world production environment that runs long > term active workloads and matters stability, the speed sometimes being > slow is not a real problem. > > But, the non-deterministic behavior is sometimes annoying, especially in > lab environments. Even in a realistic production environment, when > there is a huge amount of DAMOS action unapplicable memory, the speed > could be problematically slow. Let's suppose a virtual machines > provider that setup 99% of the host memory as hugetlb pages that cannot > be reclaimed, to give it to virtual machines. Also, when aim-oriented > DAMOS auto-tuning is applied, this could also make the internal feedback > loop confused. > > The intention of the current behavior was that trying DAMOS action to > regions would anyway impose some overhead, and therefore somehow be > charged. But in the real world, the overhead for failed action is much > lighter than successful action. Charging those at the same ratio may be > unfair, or at least suboptimum in some environments. > > DAMOS Action Failed Region Quota Charge Ratio > ============================================= > > Let users set the charge ratio for the action-failed memory, for more > optimal and deterministic use of DAMOS. It allows users to specify the > numerator and the denominator of the ratio for flexible setup. For > example, let's suppose the numerator and the denominator are set to 1 > and 4,096, respectively. The ratio is 1 / 4,096. A DAMOS scheme action > is applied to 5 GiB memory. For 1 GiB of the memory, the action is > succeeded. For the rest (4 GiB), the action is failed. Then, only 1 > GiB and 1 MiB quota is charged. > > The optimal charge ratio will depend on the use case and > system/workload. I'd recommend starting from setting the nominator as 1 > and the denominator as PAGE_SIZE and tune based on the results, because > many DAMOS actions are applied at page level. This makes sense, but the quota is also considered when setting the minimum allowable score in damos_adjust_quota(), which, to my understanding, assumes that all of the all of a region's data will by applied. If an action fails for a significant amount of the memory, a lower score than what was calculated in damos_adjust_quota() could be valid. If that's the case, the scheme would be applied to fewer regions than strictly necessary. As you mention above, this is not a correctness issue because the quota only guarantees an upper limit on the amount of data the scheme is applied to. Additionally, it may very well be true that what I listed above would not be very noticeable in practice. I just thought this was worth pointing out as something to think about. Thanks, Bijan <snip> Sent using hkml (https://github.com/sjp38/hackermail)

