Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
On Wed, Apr 16, 2014 at 12:51 PM, Julian Taylor wrote: > Hi, > In NumPy what we want is the tracing, not the exchangeable allocators. > I don't think it is a good idea for the core of a whole stack of > C-extension based modules to replace the default allocator or allowing > other modules to replace the allocator NumPy uses. I don't think modules are ever supposed to replace the underlying allocator itself -- and it'd be very difficult to do this safely, since by the time any modules are imported there are already active allocations floating around. I think the allocator replacement functionality is designed to be used by applications embedding Python, who can set it up a special allocator before the interpreter starts. I'm not sure what exactly why one would need to swap out malloc and friends for something else, so I can't really judge, but it does at least seem plausible that if someone is taking the trouble to swap out the allocator like this then numpy should respect that and use the new allocator. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
On Wed, Apr 16, 2014 at 7:35 PM, Victor Stinner wrote: > Hi, > > 2014-04-16 7:51 GMT-04:00 Julian Taylor : >> In NumPy what we want is the tracing, not the exchangeable allocators. > > Did you read the PEP 445? Using the new malloc API, in fact you can > have both: install new allocators and set up hooks on allocators. > http://legacy.python.org/dev/peps/pep-0445/ The context here is that there's been some followup discussion on the numpy list about whether there are cases where we need even more exotic memory allocators than calloc(), and what to do about it if so. (Thread: http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069935.html ) One case that has come up is when efficient use of SIMD instructions requires better-than-default alignment (e.g. malloc() usually gives something like 8 byte alignment, but if you're using an instruction that operates on 32 bytes at once you might need your array to have 32 byte alignment). Most (all?) OSes provide an extended version of malloc that allows one to request more alignment (posix_memalign on POSIX, _aligned_malloc on windows), and C11 standardizes this as aligned_alloc. An important feature of these functions is that they allocate from the same heap that 'malloc' does, i.e., when done with the aligned memory you call free() -- there's no such thing as aligned_free(). This means that if your program uses these functions then swapping out malloc/free without swapping out aligned_alloc will produce undesireable results. Numpy does not currently use aligned allocation, and it's not clear how important it is -- on older x86 it matters, but not so much on current CPUs, but when the next round of x86 SIMD instructions are released next year it might matter again, and apparently on popular IBM supercomputers it matters (but less on newer versions)[1,2], and who knows what will happen with ARM. It's a bit of a mess. But if we're messing about with APIs it seems worth thinking about. [1] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069965.html [2] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069967.html A second possible use case is: >> my_hugetlb_alloc(size) >> p = mmap('hugepagefs', ..., MAP_HUGETLB); >> PyMem_Register_Alloc(p, size, __func__, __line__); >> return p >> >> my_hugetlb_free(p); >> PyMem_Register_Free(p, __func__, __line__); >> munmap(p, ...); > > This is exactly how tracemalloc works. The advantage of the PEP 445 is > that you have a null overhead when tracemalloc is disabled. There is > no need to check if a trace function is present or not. I think the key thing about this example is that you would *never* want to use MAP_HUGETLB as a generic replacement for malloc(). Huge pages can have all kinds of weird quirky limitations, and are certainly unsuited for small allocations. BUT they can provide huge speed wins if used for certain specific allocations in certain programs. (In case anyone needs a reminder what "huge pages" even are: http://lwn.net/Articles/374424/) If I wrote a Python library to make it easy to use huge pages with numpy, then I might well want the allocations I was making to be visible to tracemalloc, even though they would not be going through malloc/free. (For that matter -- should calls to os.mmap be calling some tracemalloc hook in general? There are lots of cases where mmap is really doing memory allocation -- it's very useful for shared memory and stuff too.) --- My current impression is something like: - From the bug report discussion it sounds like calloc() is useful even in core Python, so it makes sense to go ahead with that regardless. - Now that aligned_alloc has been standardized, it might make sense to add it to the PyMemAllocator struct too. - And it might also make sense to have an API by which a Python library can say to tracemalloc: "hey FYI I just allocated something using my favorite weird exotic method", like in the huge pages example. This is a fully generic mechanism, so it could act as a kind of "safety valve" for future weirdnesses. All numpy *needs* to support its current and immediately foreseeable usage is calloc(). But I'm a bit nervous about getting trapped -- if the PyMem_* machinery implements calloc(), and we switch to using it and advertise tracemalloc support to our users, and then later it turns out that we need aligned_alloc or similar, then we'll be stuck unless and until we can get at least one of these other changes into CPython upstream, and that will suck for all of us. -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
Hi, 2014-04-16 7:51 GMT-04:00 Julian Taylor : > In NumPy what we want is the tracing, not the exchangeable allocators. Did you read the PEP 445? Using the new malloc API, in fact you can have both: install new allocators and set up hooks on allocators. http://legacy.python.org/dev/peps/pep-0445/ The PEP 445 has been implemented in Python 3.4, we don't plan to rewrite it. So it's probably better to try to understand how it was designed and why we chose this design. See the talk I just have at Pycon Montreal for more information on how tracemalloc works. Slides: https://raw.githubusercontent.com/haypo/conf/master/2014-Pycon-Montreal/tracemalloc.pdf Video: http://pyvideo.org/video/2698/track-memory-leaks-in-python > my_hugetlb_alloc(size) > p = mmap('hugepagefs', ..., MAP_HUGETLB); > PyMem_Register_Alloc(p, size, __func__, __line__); > return p > > my_hugetlb_free(p); > PyMem_Register_Free(p, __func__, __line__); > munmap(p, ...); This is exactly how tracemalloc works. The advantage of the PEP 445 is that you have a null overhead when tracemalloc is disabled. There is no need to check if a trace function is present or not. You can chain multiple hooks. See also the calloc issue which was written for NumPy: http://bugs.python.org/issue21233 Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [numpy wishlist] PyMem_*Calloc
Hi, In NumPy what we want is the tracing, not the exchangeable allocators. I don't think it is a good idea for the core of a whole stack of C-extension based modules to replace the default allocator or allowing other modules to replace the allocator NumPy uses. I think it would be more useful if Python provides functions to register memory allocations and frees and the tracemalloc module registers handlers for these register functions. If no trace allocation tracer is registered the functions just return immediately. That way the tracemalloc can be used with arbitrary allocators as long as they register their allocations with Python. For example a hugepage allocator, which you would not want to use that as the default allocator for all python objects, but you may still want to trace its usage: my_hugetlb_alloc(size) p = mmap('hugepagefs', ..., MAP_HUGETLB); PyMem_Register_Alloc(p, size, __func__, __line__); return p my_hugetlb_free(p); PyMem_Register_Free(p, __func__, __line__); munmap(p, ...); normally the registers are nops, but if tracemalloc did register tracers the memory is tracked, e.g. tracemodule does this on start(): tracercontext.register_alloc = trace_alloc tracercontext.register_free = trace_free tracercontext.data = mycontext PyMem_SetTracer(&tracercontext) Regards, Julian Taylor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
On Tue, Apr 15, 2014 at 9:31 AM, Charles-François Natali wrote: > Indeed, that's very reasonable. > > Please open an issue on the tracker! Done! http://bugs.python.org/issue21233 I'll ping numpy-discussion and see if I can convince someone to do the work ;-). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
Hi, 2014-04-14 1:39 GMT-04:00 Nathaniel Smith : > The new tracemalloc infrastructure in python 3.4 is super-interesting > to numerical folks, because we really like memory profiling. Cool, thanks :-) > calloc() is more awesome than malloc()+memset() (...) I had a discussion with someone about tracemalloc and numpy at Pycon, was it you? After this discussion, I realized that calloc() exists because the operating system can have a very efficient implementation for calloc() (as you described). > SO, we'd like to route our allocations through PyMem_* in order to let > tracemalloc "see" them, but because there is no PyMem_*Calloc, doing > this would force us to give up on the calloc() optimizations. It would also be useful in Python because in many places, Python uses memset() to fill memory with zeros. > The obvious solution is to add a PyMem_*Calloc to the API. Would this > be possible? Unfortunately it would require adding a new field to the > PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator > is exposed directly in the C API and passed by value: > https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator I don't want to change the structure in Python 3.4, but I'm interested to implement the change in Python 3.5. Please open an issue and add me to the nosy list. For Python 3.4, you can maybe add a compilation flag to use Python allocators but reimplementing calloc() which will be slower as you explained. Victor ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
Indeed, that's very reasonable. Please open an issue on the tracker! ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
On 04/14/2014 08:36 AM, Benjamin Peterson wrote: On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote: SO, we'd like to route our allocations through PyMem_* in order to let tracemalloc "see" them, but because there is no PyMem_*Calloc, doing this would force us to give up on the calloc() optimizations. Well, the allocator API is not part of the stable ABI, so we can change it if we want. Thoughts? I think the request is completely reasonable. +1 -- ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc
On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote: > Hi all, > > The new tracemalloc infrastructure in python 3.4 is super-interesting > to numerical folks, because we really like memory profiling. Numerical > programs allocate a lot of memory, and sometimes it's not clear which > operations allocate memory (some numpy operations return views of the > original array without allocating anything; others return copies). So > people actually use memory tracking tools[1], even though > traditionally these have been pretty hacky (i.e., just checking RSS > before and after each line is executed), and numpy has even grown its > own little tracemalloc-like infrastructure [2], but it only works for > numpy data. > > BUT, we also really like calloc(). One of the basic array creation > routines in numpy is numpy.zeros(), which returns an array full of -- > you guessed it -- zeros. For pretty much all the data types numpy > supports, the value zero is represented by the bytestring consisting > of all zeros. So numpy.zeros() usually uses calloc() to allocate its > memory. > > calloc() is more awesome than malloc()+memset() for two reasons. > First, calloc() for larger allocations is usually implemented using > clever VM tricks, so that it doesn't actually allocate any memory up > front, it just creates a COW mapping of the system zero page and then > does the actual allocation one page at a time as different entries are > written to. This means that in the somewhat common case where you > allocate a large array full of zeros, and then only set a few > scattered entries to non-zero values, you can end up using much much > less memory than otherwise. It's entirely possible for this to make > the difference between being able to run an analysis versus not. > memset() forces the whole amount of RAM to be committed immediately. > > Secondly, even if you *are* going to touch all the memory, then > calloc() is still faster than malloc()+memset(). The reason is that > for large allocations, malloc() usually does a calloc() no matter what > -- when you get a new page from the kernel, the kernel has to make > sure you can't see random bits of other processes's memory, so it > unconditionally zeros out the page before you get to see it. calloc() > knows this, so it doesn't bother zeroing it again. malloc()+memset(), > by contrast, zeros the page twice, producing twice as much memory > traffic, which is huge. > > SO, we'd like to route our allocations through PyMem_* in order to let > tracemalloc "see" them, but because there is no PyMem_*Calloc, doing > this would force us to give up on the calloc() optimizations. > > The obvious solution is to add a PyMem_*Calloc to the API. Would this > be possible? Unfortunately it would require adding a new field to the > PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator > is exposed directly in the C API and passed by value: > https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator > (Too bad we didn't notice this a few months ago before 3.4 was > released :-(.) I guess we could just rename the struct in 3.5, to > force people to update their code. (I guess there aren't too many > people who would have to update their code.) Well, the allocator API is not part of the stable ABI, so we can change it if we want. > > Thoughts? I think the request is completely reasonable. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] [numpy wishlist] PyMem_*Calloc
Hi all, The new tracemalloc infrastructure in python 3.4 is super-interesting to numerical folks, because we really like memory profiling. Numerical programs allocate a lot of memory, and sometimes it's not clear which operations allocate memory (some numpy operations return views of the original array without allocating anything; others return copies). So people actually use memory tracking tools[1], even though traditionally these have been pretty hacky (i.e., just checking RSS before and after each line is executed), and numpy has even grown its own little tracemalloc-like infrastructure [2], but it only works for numpy data. BUT, we also really like calloc(). One of the basic array creation routines in numpy is numpy.zeros(), which returns an array full of -- you guessed it -- zeros. For pretty much all the data types numpy supports, the value zero is represented by the bytestring consisting of all zeros. So numpy.zeros() usually uses calloc() to allocate its memory. calloc() is more awesome than malloc()+memset() for two reasons. First, calloc() for larger allocations is usually implemented using clever VM tricks, so that it doesn't actually allocate any memory up front, it just creates a COW mapping of the system zero page and then does the actual allocation one page at a time as different entries are written to. This means that in the somewhat common case where you allocate a large array full of zeros, and then only set a few scattered entries to non-zero values, you can end up using much much less memory than otherwise. It's entirely possible for this to make the difference between being able to run an analysis versus not. memset() forces the whole amount of RAM to be committed immediately. Secondly, even if you *are* going to touch all the memory, then calloc() is still faster than malloc()+memset(). The reason is that for large allocations, malloc() usually does a calloc() no matter what -- when you get a new page from the kernel, the kernel has to make sure you can't see random bits of other processes's memory, so it unconditionally zeros out the page before you get to see it. calloc() knows this, so it doesn't bother zeroing it again. malloc()+memset(), by contrast, zeros the page twice, producing twice as much memory traffic, which is huge. SO, we'd like to route our allocations through PyMem_* in order to let tracemalloc "see" them, but because there is no PyMem_*Calloc, doing this would force us to give up on the calloc() optimizations. The obvious solution is to add a PyMem_*Calloc to the API. Would this be possible? Unfortunately it would require adding a new field to the PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator is exposed directly in the C API and passed by value: https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator (Too bad we didn't notice this a few months ago before 3.4 was released :-(.) I guess we could just rename the struct in 3.5, to force people to update their code. (I guess there aren't too many people who would have to update their code.) Thoughts? -n [1] http://scikit-learn.org/stable/developers/performance.html#memory-usage-profiling [2] https://github.com/numpy/numpy/pull/309 -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com