On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote: > Hi all, > > The new tracemalloc infrastructure in python 3.4 is super-interesting > to numerical folks, because we really like memory profiling. Numerical > programs allocate a lot of memory, and sometimes it's not clear which > operations allocate memory (some numpy operations return views of the > original array without allocating anything; others return copies). So > people actually use memory tracking tools[1], even though > traditionally these have been pretty hacky (i.e., just checking RSS > before and after each line is executed), and numpy has even grown its > own little tracemalloc-like infrastructure [2], but it only works for > numpy data. > > BUT, we also really like calloc(). One of the basic array creation > routines in numpy is numpy.zeros(), which returns an array full of -- > you guessed it -- zeros. For pretty much all the data types numpy > supports, the value zero is represented by the bytestring consisting > of all zeros. So numpy.zeros() usually uses calloc() to allocate its > memory. > > calloc() is more awesome than malloc()+memset() for two reasons. > First, calloc() for larger allocations is usually implemented using > clever VM tricks, so that it doesn't actually allocate any memory up > front, it just creates a COW mapping of the system zero page and then > does the actual allocation one page at a time as different entries are > written to. This means that in the somewhat common case where you > allocate a large array full of zeros, and then only set a few > scattered entries to non-zero values, you can end up using much much > less memory than otherwise. It's entirely possible for this to make > the difference between being able to run an analysis versus not. > memset() forces the whole amount of RAM to be committed immediately. > > Secondly, even if you *are* going to touch all the memory, then > calloc() is still faster than malloc()+memset(). The reason is that > for large allocations, malloc() usually does a calloc() no matter what > -- when you get a new page from the kernel, the kernel has to make > sure you can't see random bits of other processes's memory, so it > unconditionally zeros out the page before you get to see it. calloc() > knows this, so it doesn't bother zeroing it again. malloc()+memset(), > by contrast, zeros the page twice, producing twice as much memory > traffic, which is huge. > > SO, we'd like to route our allocations through PyMem_* in order to let > tracemalloc "see" them, but because there is no PyMem_*Calloc, doing > this would force us to give up on the calloc() optimizations. > > The obvious solution is to add a PyMem_*Calloc to the API. Would this > be possible? Unfortunately it would require adding a new field to the > PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator > is exposed directly in the C API and passed by value: > https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator > (Too bad we didn't notice this a few months ago before 3.4 was > released :-(.) I guess we could just rename the struct in 3.5, to > force people to update their code. (I guess there aren't too many > people who would have to update their code.)
Well, the allocator API is not part of the stable ABI, so we can change it if we want. > > Thoughts? I think the request is completely reasonable. _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com