[Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-14 Thread Nathaniel Smith
Hi all,

The new tracemalloc infrastructure in python 3.4 is super-interesting
to numerical folks, because we really like memory profiling. Numerical
programs allocate a lot of memory, and sometimes it's not clear which
operations allocate memory (some numpy operations return views of the
original array without allocating anything; others return copies). So
people actually use memory tracking tools[1], even though
traditionally these have been pretty hacky (i.e., just checking RSS
before and after each line is executed), and numpy has even grown its
own little tracemalloc-like infrastructure [2], but it only works for
numpy data.

BUT, we also really like calloc(). One of the basic array creation
routines in numpy is numpy.zeros(), which returns an array full of --
you guessed it -- zeros. For pretty much all the data types numpy
supports, the value zero is represented by the bytestring consisting
of all zeros. So numpy.zeros() usually uses calloc() to allocate its
memory.

calloc() is more awesome than malloc()+memset() for two reasons.
First, calloc() for larger allocations is usually implemented using
clever VM tricks, so that it doesn't actually allocate any memory up
front, it just creates a COW mapping of the system zero page and then
does the actual allocation one page at a time as different entries are
written to. This means that in the somewhat common case where you
allocate a large array full of zeros, and then only set a few
scattered entries to non-zero values, you can end up using much much
less memory than otherwise. It's entirely possible for this to make
the difference between being able to run an analysis versus not.
memset() forces the whole amount of RAM to be committed immediately.

Secondly, even if you *are* going to touch all the memory, then
calloc() is still faster than malloc()+memset(). The reason is that
for large allocations, malloc() usually does a calloc() no matter what
-- when you get a new page from the kernel, the kernel has to make
sure you can't see random bits of other processes's memory, so it
unconditionally zeros out the page before you get to see it. calloc()
knows this, so it doesn't bother zeroing it again. malloc()+memset(),
by contrast, zeros the page twice, producing twice as much memory
traffic, which is huge.

SO, we'd like to route our allocations through PyMem_* in order to let
tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
this would force us to give up on the calloc() optimizations.

The obvious solution is to add a PyMem_*Calloc to the API. Would this
be possible? Unfortunately it would require adding a new field to the
PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator
is exposed directly in the C API and passed by value:
  https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator
(Too bad we didn't notice this a few months ago before 3.4 was
released :-(.) I guess we could just rename the struct in 3.5, to
force people to update their code. (I guess there aren't too many
people who would have to update their code.)

Thoughts?

-n

[1] 
http://scikit-learn.org/stable/developers/performance.html#memory-usage-profiling
[2] https://github.com/numpy/numpy/pull/309

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-16 Thread Julian Taylor
Hi,
In NumPy what we want is the tracing, not the exchangeable allocators.
I don't think it is a good idea for the core of a whole stack of
C-extension based modules to replace the default allocator or allowing
other modules to replace the allocator NumPy uses.

I think it would be more useful if Python provides functions to
register memory allocations and frees and the tracemalloc module
registers handlers for these register functions.
If no trace allocation tracer is registered the functions just return
immediately.
That way the tracemalloc can be used with arbitrary allocators as long
as they register their allocations with Python.

For example a hugepage allocator, which you would not want to use that
as the default allocator for all python objects, but you may still
want to trace its usage:

my_hugetlb_alloc(size)
p = mmap('hugepagefs', ..., MAP_HUGETLB);
PyMem_Register_Alloc(p, size, __func__, __line__);
return p

my_hugetlb_free(p);
PyMem_Register_Free(p, __func__, __line__);
munmap(p, ...);

normally the registers are nops, but if tracemalloc did register
tracers the memory is tracked, e.g. tracemodule does this on start():
tracercontext.register_alloc = trace_alloc
tracercontext.register_free = trace_free
tracercontext.data = mycontext
PyMem_SetTracer(&tracercontext)

Regards,
Julian Taylor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-14 Thread Benjamin Peterson
On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote:
> Hi all,
> 
> The new tracemalloc infrastructure in python 3.4 is super-interesting
> to numerical folks, because we really like memory profiling. Numerical
> programs allocate a lot of memory, and sometimes it's not clear which
> operations allocate memory (some numpy operations return views of the
> original array without allocating anything; others return copies). So
> people actually use memory tracking tools[1], even though
> traditionally these have been pretty hacky (i.e., just checking RSS
> before and after each line is executed), and numpy has even grown its
> own little tracemalloc-like infrastructure [2], but it only works for
> numpy data.
> 
> BUT, we also really like calloc(). One of the basic array creation
> routines in numpy is numpy.zeros(), which returns an array full of --
> you guessed it -- zeros. For pretty much all the data types numpy
> supports, the value zero is represented by the bytestring consisting
> of all zeros. So numpy.zeros() usually uses calloc() to allocate its
> memory.
> 
> calloc() is more awesome than malloc()+memset() for two reasons.
> First, calloc() for larger allocations is usually implemented using
> clever VM tricks, so that it doesn't actually allocate any memory up
> front, it just creates a COW mapping of the system zero page and then
> does the actual allocation one page at a time as different entries are
> written to. This means that in the somewhat common case where you
> allocate a large array full of zeros, and then only set a few
> scattered entries to non-zero values, you can end up using much much
> less memory than otherwise. It's entirely possible for this to make
> the difference between being able to run an analysis versus not.
> memset() forces the whole amount of RAM to be committed immediately.
> 
> Secondly, even if you *are* going to touch all the memory, then
> calloc() is still faster than malloc()+memset(). The reason is that
> for large allocations, malloc() usually does a calloc() no matter what
> -- when you get a new page from the kernel, the kernel has to make
> sure you can't see random bits of other processes's memory, so it
> unconditionally zeros out the page before you get to see it. calloc()
> knows this, so it doesn't bother zeroing it again. malloc()+memset(),
> by contrast, zeros the page twice, producing twice as much memory
> traffic, which is huge.
> 
> SO, we'd like to route our allocations through PyMem_* in order to let
> tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
> this would force us to give up on the calloc() optimizations.
> 
> The obvious solution is to add a PyMem_*Calloc to the API. Would this
> be possible? Unfortunately it would require adding a new field to the
> PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator
> is exposed directly in the C API and passed by value:
>   https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator
> (Too bad we didn't notice this a few months ago before 3.4 was
> released :-(.) I guess we could just rename the struct in 3.5, to
> force people to update their code. (I guess there aren't too many
> people who would have to update their code.)

Well, the allocator API is not part of the stable ABI, so we can change
it if we want.

> 
> Thoughts?

I think the request is completely reasonable.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-14 Thread Ethan Furman

On 04/14/2014 08:36 AM, Benjamin Peterson wrote:

On Sun, Apr 13, 2014, at 22:39, Nathaniel Smith wrote:


SO, we'd like to route our allocations through PyMem_* in order to let
tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
this would force us to give up on the calloc() optimizations.


Well, the allocator API is not part of the stable ABI, so we can change
it if we want.




Thoughts?


I think the request is completely reasonable.


+1

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-15 Thread Charles-François Natali
Indeed, that's very reasonable.

Please open an issue on the tracker!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-15 Thread Victor Stinner
  Hi,

2014-04-14 1:39 GMT-04:00 Nathaniel Smith :
> The new tracemalloc infrastructure in python 3.4 is super-interesting
> to numerical folks, because we really like memory profiling.

Cool, thanks :-)

> calloc() is more awesome than malloc()+memset() (...)

I had a discussion with someone about tracemalloc and numpy at Pycon,
was it you? After this discussion, I realized that calloc() exists
because the operating system can have a very efficient implementation
for calloc() (as you described).

> SO, we'd like to route our allocations through PyMem_* in order to let
> tracemalloc "see" them, but because there is no PyMem_*Calloc, doing
> this would force us to give up on the calloc() optimizations.

It would also be useful in Python because in many places, Python uses
memset() to fill memory with zeros.

> The obvious solution is to add a PyMem_*Calloc to the API. Would this
> be possible? Unfortunately it would require adding a new field to the
> PyMemAllocator struct, which would be an ABI/API break; PyMemAllocator
> is exposed directly in the C API and passed by value:
>   https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator

I don't want to change the structure in Python 3.4, but I'm interested
to implement the change in Python 3.5.

Please open an issue and add me to the nosy list.

For Python 3.4, you can maybe add a compilation flag to use Python
allocators but reimplementing calloc() which will be slower as you
explained.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-15 Thread Nathaniel Smith
On Tue, Apr 15, 2014 at 9:31 AM, Charles-François Natali
 wrote:
> Indeed, that's very reasonable.
>
> Please open an issue on the tracker!

Done!

http://bugs.python.org/issue21233

I'll ping numpy-discussion and see if I can convince someone to do the work ;-).

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-16 Thread Victor Stinner
Hi,

2014-04-16 7:51 GMT-04:00 Julian Taylor :
> In NumPy what we want is the tracing, not the exchangeable allocators.

Did you read the PEP 445? Using the new malloc API, in fact you can
have both: install new allocators and set up hooks on allocators.
http://legacy.python.org/dev/peps/pep-0445/

The PEP 445 has been implemented in Python 3.4, we don't plan to
rewrite it. So it's probably better to try to understand how it was
designed and why we chose this design.

See the talk I just have at Pycon Montreal for more information on how
tracemalloc works. Slides:
https://raw.githubusercontent.com/haypo/conf/master/2014-Pycon-Montreal/tracemalloc.pdf

Video:
http://pyvideo.org/video/2698/track-memory-leaks-in-python

> my_hugetlb_alloc(size)
> p = mmap('hugepagefs', ..., MAP_HUGETLB);
> PyMem_Register_Alloc(p, size, __func__, __line__);
> return p
>
> my_hugetlb_free(p);
> PyMem_Register_Free(p, __func__, __line__);
> munmap(p, ...);

This is exactly how tracemalloc works. The advantage of the PEP 445 is
that you have a null overhead when tracemalloc is disabled. There is
no need to check if a trace function is present or not.

You can chain multiple hooks.

See also the calloc issue which was written for NumPy:
http://bugs.python.org/issue21233

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-17 Thread Nathaniel Smith
On Wed, Apr 16, 2014 at 12:51 PM, Julian Taylor
 wrote:
> Hi,
> In NumPy what we want is the tracing, not the exchangeable allocators.
> I don't think it is a good idea for the core of a whole stack of
> C-extension based modules to replace the default allocator or allowing
> other modules to replace the allocator NumPy uses.

I don't think modules are ever supposed to replace the underlying
allocator itself -- and it'd be very difficult to do this safely,
since by the time any modules are imported there are already active
allocations floating around. I think the allocator replacement
functionality is designed to be used by applications embedding Python,
who can set it up a special allocator before the interpreter starts.

I'm not sure what exactly why one would need to swap out malloc and
friends for something else, so I can't really judge, but it does at
least seem plausible that if someone is taking the trouble to swap out
the allocator like this then numpy should respect that and use the new
allocator.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [numpy wishlist] PyMem_*Calloc

2014-04-17 Thread Nathaniel Smith
On Wed, Apr 16, 2014 at 7:35 PM, Victor Stinner
 wrote:
> Hi,
>
> 2014-04-16 7:51 GMT-04:00 Julian Taylor :
>> In NumPy what we want is the tracing, not the exchangeable allocators.
>
> Did you read the PEP 445? Using the new malloc API, in fact you can
> have both: install new allocators and set up hooks on allocators.
> http://legacy.python.org/dev/peps/pep-0445/

The context here is that there's been some followup discussion on the
numpy list about whether there are cases where we need even more
exotic memory allocators than calloc(), and what to do about it if so.

(Thread: http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069935.html
)

One case that has come up is when efficient use of SIMD instructions
requires better-than-default alignment (e.g. malloc() usually gives
something like 8 byte alignment, but if you're using an instruction
that operates on 32 bytes at once you might need your array to have 32
byte alignment). Most (all?) OSes provide an extended version of
malloc that allows one to request more alignment (posix_memalign on
POSIX, _aligned_malloc on windows), and C11 standardizes this as
aligned_alloc. An important feature of these functions is that they
allocate from the same heap that 'malloc' does, i.e., when done with
the aligned memory you call free() -- there's no such thing as
aligned_free(). This means that if your program uses these functions
then swapping out malloc/free without swapping out aligned_alloc will
produce undesireable results.

Numpy does not currently use aligned allocation, and it's not clear
how important it is -- on older x86 it matters, but not so much on
current CPUs, but when the next round of x86 SIMD instructions are
released next year it might matter again, and apparently on popular
IBM supercomputers it matters (but less on newer versions)[1,2], and
who knows what will happen with ARM. It's a bit of a mess. But if
we're messing about with APIs it seems worth thinking about.

[1] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069965.html
[2] http://mail.scipy.org/pipermail/numpy-discussion/2014-April/069967.html

A second possible use case is:

>> my_hugetlb_alloc(size)
>> p = mmap('hugepagefs', ..., MAP_HUGETLB);
>> PyMem_Register_Alloc(p, size, __func__, __line__);
>> return p
>>
>> my_hugetlb_free(p);
>> PyMem_Register_Free(p, __func__, __line__);
>> munmap(p, ...);
>
> This is exactly how tracemalloc works. The advantage of the PEP 445 is
> that you have a null overhead when tracemalloc is disabled. There is
> no need to check if a trace function is present or not.

I think the key thing about this example is that you would *never*
want to use MAP_HUGETLB as a generic replacement for malloc(). Huge
pages can have all kinds of weird quirky limitations, and are
certainly unsuited for small allocations. BUT they can provide huge
speed wins if used for certain specific allocations in certain
programs. (In case anyone needs a reminder what "huge pages" even are:
http://lwn.net/Articles/374424/)

If I wrote a Python library to make it easy to use huge pages with
numpy, then I might well want the allocations I was making to be
visible to tracemalloc, even though they would not be going through
malloc/free.

(For that matter -- should calls to os.mmap be calling some
tracemalloc hook in general? There are lots of cases where mmap is
really doing memory allocation -- it's very useful for shared memory
and stuff too.)

---

My current impression is something like:

- From the bug report discussion it sounds like calloc() is useful
even in core Python, so it makes sense to go ahead with that
regardless.
- Now that aligned_alloc has been standardized, it might make sense to
add it to the PyMemAllocator struct too.
- And it might also make sense to have an API by which a Python
library can say to tracemalloc: "hey FYI I just allocated something
using my favorite weird exotic method", like in the huge pages
example. This is a fully generic mechanism, so it could act as a kind
of "safety valve" for future weirdnesses.

All numpy *needs* to support its current and immediately foreseeable
usage is calloc(). But I'm a bit nervous about getting trapped -- if
the PyMem_* machinery implements calloc(), and we switch to using it
and advertise tracemalloc support to our users, and then later it
turns out that we need aligned_alloc or similar, then we'll be stuck
unless and until we can get at least one of these other changes into
CPython upstream, and that will suck for all of us.

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com