Hi Andrew! On 2023-02-16T23:06:44+0100, I wrote: > On 2023-02-16T16:17:32+0000, "Stubbs, Andrew via Gcc-patches" > <gcc-patches@gcc.gnu.org> wrote: >> The mmap implementation was not optimized for a lot of small allocations, >> and I can't see that issue changing here > > That's correct, 'mmap' remains. Under the hood, 'cuMemHostRegister' must > surely also be doing some 'mlock'-like thing, so I figured it's best to > feed page-boundary memory regions to it, which 'mmap' gets us. > >> so I don't know if this can be used for mlockall replacement. >> >> I had assumed that using the Cuda allocator would fix that limitation. > > From what I've read (but no first-hand experiments), there's non-trivial > overhead with 'cuMemHostRegister' (just like with 'mlock'), so routing > all small allocations individually through it probably isn't a good idea > either. Therefore, I suppose, we'll indeed want to use some local > allocator if we wish this "optimized for a lot of small allocations".
Eh, I suppose your point indirectly was that instead of 'mmap' plus 'cuMemHostRegister' we ought to use 'cuMemAllocHost'/'cuMemHostAlloc', as we assume those already do implement such a local allocator. Let me quickly change that indeed -- we don't currently have a need to use 'cuMemHostRegister' instead of 'cuMemAllocHost'/'cuMemHostAlloc'. > And, getting rid of 'mlockall' is yet another topic. Here, the need to use 'cuMemHostRegister' may then again come up, as begun to discuss as my "different idea" re "-foffload-memory=pinned", <https://inbox.sourceware.org/gcc-patches/87sff9zl3u....@euler.schwinge.homeip.net>. (Let's continue that discussion there.) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955