Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Charles R Harris Wed, 17 Jul 2013 14:44:24 -0700

On Wed, Jul 17, 2013 at 10:57 AM, Frédéric Bastien <no...@nouiz.org> wrote:


>
>
>
> On Wed, Jul 17, 2013 at 10:39 AM, Nathaniel Smith <n...@pobox.com> wrote:
>
>> On Tue, Jul 16, 2013 at 7:53 PM, Frédéric Bastien <no...@nouiz.org>
>> wrote:
>> > Hi,
>> >
>> >
>> > On Tue, Jul 16, 2013 at 11:55 AM, Nathaniel Smith <n...@pobox.com>
>> wrote:
>> >>
>> >> On Tue, Jul 16, 2013 at 2:34 PM, Arink Verma <arinkve...@gmail.com>
>> wrote:
>> >>>
>> >>> >Each ndarray does two mallocs, for the obj and buffer. These could be
>> >>> > combined into 1 - just allocate the total size and do some pointer
>> >>> > >arithmetic, then set OWNDATA to false.
>> >>> So, that two mallocs has been mentioned in project introduction. I got
>> >>> that wrong.
>> >>
>> >>
>> >> On further thought/reading the code, it appears to be more complicated
>> >> than that, actually.
>> >>
>> >> It looks like (for a non-scalar array) we have 2 calls to
>> PyMem_Malloc: 1
>> >> for the array object itself, and one for the shapes + strides. And,
>> one call
>> >> to regular-old malloc: for the data buffer.
>> >>
>> >> (Mysteriously, shapes + strides together have 2*ndim elements, but to
>> hold
>> >> them we allocate a memory region sized to hold 3*ndim elements. I'm
>> not sure
>> >> why.)
>> >>
>> >> And contrary to what I said earlier, this is about as optimized as it
>> can
>> >> be without breaking ABI. We need at least 2 calls to
>> malloc/PyMem_Malloc,
>> >> because the shapes+strides may need to be resized without affecting
>> the much
>> >> larger data area. But it's tempting to allocate the array object and
>> the
>> >> data buffer in a single memory region, like I suggested earlier. And
>> this
>> >> would ALMOST work. But, it turns out there is code out there which
>> assumes
>> >> (whether wisely or not) that you can swap around which data buffer a
>> given
>> >> PyArrayObject refers to (hi Theano!). And supporting this means that
>> data
>> >> buffers and PyArrayObjects need to be in separate memory regions.
>> >
>> >
>> > Are you sure that Theano "swap" the data ptr of an ndarray? When we play
>> > with that, it is on a newly create ndarray. So a node in our graph,
>> won't
>> > change the input ndarray structure. It will create a new ndarray
>> structure
>> > with new shape/strides and pass a data ptr and we flag the new ndarray
>> with
>> > own_data correctly to my knowledge.
>> >
>> > If Theano pose a problem here, I'll suggest that I fix Theano. But
>> currently
>> > I don't see the problem. So if this make you change your mind about this
>> > optimization, tell me. I don't want Theano to prevent optimization in
>> NumPy.
>>
>> It's entirely possible I misunderstood, so let's see if we can work it
>> out. I know that you want to assign to the ->data pointer in a
>> PyArrayObject, right? That's what caused some trouble with the 1.7 API
>> deprecations, which were trying to prevent direct access to this
>> field? Creating a new array given a pointer to a memory region is no
>> problem, and obviously will be supported regardless of any
>> optimizations. But if that's all you were doing then you shouldn't
>> have run into the deprecation problem. Or maybe I'm misremembering!
>>
>
> What is currently done at only 1 place is to create a new PyArrayObject
> with a given ptr. So NumPy don't do the allocation. We later change that
> ptr to another one.
>
> It is the change to the ptr of the just created PyArrayObject that caused
> problem with the interface deprecation. I fixed all other problem releated
> to the deprecation (mostly just rename of function/macro). But I didn't
> fixed this one yet. I would need to change the logic to compute the final
> ptr before creating the PyArrayObject object and create it with the final
> data ptr. But in call cases, NumPy didn't allocated data memory for this
> object, so this case don't block your optimization.
>
> One thing in our optimization "wish list" is to reuse allocated
> PyArrayObject between Theano function call for intermediate results(so
> completly under Theano control). This could be useful in particular for
> reshape/transpose/subtensor. Those functions are pretty fast and from
> memory, I already found the allocation time was significant. But in those
> cases, it is on PyArrayObject that are views, so the metadata and the data
> would be in different memory region in all cases.
>
> The other cases of optimization "wish list"  is if  we want to reuse the
> PyArrayObject when the shape isn't the good one (but the number of
> dimensions is the same). If we do that for operation like addition, we will
> need to use PyArray_Resize(). This will be done on PyArrayObject whose data
> memory was allocated by NumPy. So if you do one memory allowcation for
> metadata and data, just make sure that PyArray_Resize() will handle that
> correctly.
>
> On the usefulness of doing only 1 memory allocation, on our old gpu
> ndarray, we where doing 2 alloc on the GPU, one for metadata and one for
> data. I removed this, as this was a bottleneck. allocation on the CPU are
> faster the on the GPU, but this is still something that is slow except if
> you reuse memory. Do PyMem_Malloc, reuse previous small allocation?
>
> For those that read up all this, the conclusion is that Theano should
> block this optimization. If you optimize the allocation of new
> PyArrayObject, they will be less incentive to do the "wish list"
> optimization.
>
> One last thing to keep in mind is that you should keep the data segment
> aligned. I would arg that alignment on the datatype size isn't enough, so I
> would suggest on cache line size or something like this. But I don't have
> number to base this one. This would also help in the case of resize that
> change the number of dimensions.
>
>
There is a similar thing done in f2py which is still keeping it from being
current with the 1.7 macro replacement by functions. I'd like to add a
'swap' type function and would welcome discussion/implementation fo such.

Chuck

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Speedup by avoiding memory alloc twice in scalar array

Reply via email to