Re: [Numpy-discussion] caching large allocations on gnu/linux

Francesc Alted Mon, 13 Mar 2017 11:54:56 -0700

2017-03-13 18:11 GMT+01:00 Julian Taylor <[email protected]>:


> On 13.03.2017 16:21, Anne Archibald wrote:
> >
> >
> > On Mon, Mar 13, 2017 at 12:21 PM Julian Taylor
> > <[email protected] <mailto:[email protected]>>
> > wrote:
> >
> >     Should it be agreed that caching is worthwhile I would propose a very
> >     simple implementation. We only really need to cache a small handful
> of
> >     array data pointers for the fast allocate deallocate cycle that
> appear
> >     in common numpy usage.
> >     For example a small list of maybe 4 pointers storing the 4 largest
> >     recent deallocations. New allocations just pick the first memory
> block
> >     of sufficient size.
> >     The cache would only be active on systems that support MADV_FREE
> (which
> >     is linux 4.5 and probably BSD too).
> >
> >     So what do you think of this idea?
> >
> >
> > This is an interesting thought, and potentially a nontrivial speedup
> > with zero user effort. But coming up with an appropriate caching policy
> > is going to be tricky. The thing is, for each array, numpy grabs a block
> > "the right size", and that size can easily vary by orders of magnitude,
> > even within the temporaries of a single expression as a result of
> > broadcasting. So simply giving each new array the smallest cached block
> > that will fit could easily result in small arrays in giant allocated
> > blocks, wasting non-reclaimable memory.  So really you want to recycle
> > blocks of the same size, or nearly, which argues for a fairly large
> > cache, with smart indexing of some kind.
> >
>
> The nice thing about MADV_FREE is that we don't need any clever cache.
> The same process that marked the pages free can reclaim them in another
> allocation, at least that is what my testing indicates it allows.
> So a small allocation getting a huge memory block does not waste memory
> as the top unused part will get reclaimed when needed, either by numpy
> itself doing another allocation or a different program on the system.
>

Well, what you say makes a lot of sense to me, so if you have tested that
then I'd say that this is worth a PR and see how it works on different
workloads.


>
> An issue that does arise though is that this memory is not available for
> the page cache used for caching on disk data. A too large cache might
> then be detrimental for IO heavy workloads that rely on the page cache.
>

Yeah.  Also, memory mapped arrays use the page cache intensively, so we
should test this use case and see how the caching affects memory map
performance.


> So we might want to cap it to some max size, provide an explicit on/off
> switch and/or have numpy IO functions clear the cache.
>

Definitely dynamically
 allowing the disabling
this feature would be desirable.  That would provide an easy path for
testing how it affects performance.  Would that be feasible?


Francesc

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] caching large allocations on gnu/linux

Reply via email to