Hi Roland, On Tue, 19 May 2009 15:01:13 -0700 Roland Dreier <[email protected]> wrote:
> > QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers > > size to the page size and then allocates page aligned memory using > > posix_memalign(). > > > > However, this allocation is quite wasteful on architectures using 64K > pages > > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc > > parameter and chunks are allocated using mmap. thus we end up allocating: > > > > (requested size rounded to the page size) + (page size) + (malloc overhead) > > > > rounded internally to the page size. > > > > So for example, if we request a buffer of page_size bytes, we end up > > consuming 3 pages. In short, for each QP buffer we allocate, there is an > > overhead of 2 pages. This is quite visible on large clusters especially > where > > the number of QP can reach several thousands. > > > > This patch creates a new function mlx4_alloc_page() for use by > > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when > > the page size is 64K. > > makes sense I guess. It would be nice if glibc() were smart enough to > know that mmap(MAP_ANONYMOUS) is going to give something page-aligned > anyway, If you mean in the posix_memalign() path, then yes it'd be really nice. > but it seems that malloc overhead (required to make the memory > from posix_memalign() work with free()) is going to cost at least one > extra page, which as you point out is pretty bad with 64KB pages. (Of > course 64KB pages are a disaster for any workload that deals with small > objects of any kind, but that's another story) Yep, agreed. > > However I wonder why we want to make this optimization only for 64KB > pages. It seems the code would be simpler if we just had our own > page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it > unconditionally everywhere. Or is it not actually better even on > sane-sized (ie 4KB) page systems? It seems we still have the malloc > overhead which is going to cost us another page? Well not really, because if we stay below MMAP_THRESHOLD, as we do with 4K pages, the only overhead is malloc's chaining structure. The extra space used to align the buffer is released before posix_memalign() returns, but that does increase fragmentation of mallocs chunks. Also, for 4K pages, mmap() systematically results in a syscall whereas posix_memalign() does not necessarily, but as we're not on a fast path I'm not sure what would be best. I don't mind converting all QP buffers allocation to mmap(), but I'd like to hear what people think. Thanks Roland, Sebastien. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
