On Thu, Oct 23, 2025 at 1:44 AM Pedro Falcato <[email protected]> wrote: > > On Thu, Oct 23, 2025 at 09:00:10AM +0100, Lorenzo Stoakes wrote: > > On Wed, Oct 22, 2025 at 10:22:08PM +0200, David Hildenbrand wrote: > > > On 22.10.25 21:52, Christoph Lameter (Ampere) wrote: > > > > On Wed, 22 Oct 2025, Nico Pache wrote: > > > > > > > > > Currently, madvise_collapse only supports collapsing to PMD-sized > > > > > THPs + > > > > > and does not attempt mTHP collapses. + > > > > > > > > madvise collapse is frequently used as far as I can tell from the THP > > > > loads being tested. Could we support madvise collapse for mTHP? > > > > > > The big question is still how user space can communicate the desired > > > order, > > > and how we can not break existing users. > > > > Do we want to let userspace communicate order? It seems like an extremely > specific thing to do. A more simple&sane semantic could be something like: > "MADV_COLLAPSE collapses a given [addr, addr+len] range into the highest > order THP it can/thinks it should.". The implementation details of PMD or > contpte or <...> are lost by the time we get to userspace. > > The man page itself is pretty vaguely written to allow us to do whatever > we want. It sounds to me that allowing userspace to create arbitrary order > mTHPs would be another pandora's box we shouldn't get into. > > > Yes, and let's go one step at a time, this series still needs careful > > scrutiny > > and we need to ensure the _fundamentals_ are in place for khugepaged before > > we > > get into MADV_COLLAPSE :) > > > > > > > > So I guess there will definitely be some support to trigger collapse to > > > mTHP > > > in the future, the big question is through which interface. So it will > > > happen after this series. > > > > Yes. > > > > > > > > Maybe through process_madvise() where we have an additional parameter, I > > > think that was what people discussed in the past. > > > > I wouldn't absolutely love us doing that, given it is a general parameter so > > would seem applicable to any madvise() option and could lead to confusion, > > also > > process_madvise() was originally for cross-process madvise vector > > operations. > > For what it's worth, it would probably not be too hard to devise a generic > separation there between "generic flags" and "behavior-specific flags". > And then stuff the desired THP order into MADV_COLLAPSE-specific flags.
Yeah, this is how I envisioned the flags to be leveraged; reserve some number of bits for generic, and overload the others for advice-specific. I suspect once the seal is broken on this, more advice-specific flags will promptly follow. > > > > I expanded this to make it applicable to the current process (and introduced > > PIDFD_SELF to make that more sane), and SJ has optimised it across vector > > operations (thanks SJ! :), but in general - it seems very weird to have > > madvise() provide an operation that process_madvise() providse another > > version > > of that has an extra parameter. > > > > As usual we've painted ourselves into a corner with an API... :) > > But yes, I agree it would feel weird. > > > > > Perhaps we'll to accept the process_madvise() compromise and add > > MADV_COLLAPSE_MHTP that only works with it or something. > > > > Of course adding a new syscall isn't impossible... madvise2() not very > > appealing > > however... > > It is my impression that process_madvise() is already madvise2(), but > poorly named. +1 > > > > TL;DR I guess we'll deal with that when we come to it :) > > Amen :) > > -- > Pedro
