Eric Wieser wrote >> Yes, sorry, had been a while since I had looked it up: >> >> https://docs.python.org/3/c-api/memory.html#c.PyMemAllocatorEx > > That `PyMemAllocatorEx` looks almost exactly like one of the two variants > I was proposing. Is there a reason for wanting to define our own structure > vs just using that one? > I think the NEP should at least offer a brief comparison to that > structure, even if we ultimately end up not using it.
Agreed. Eric Wieser wrote >> But right now the proposal says this is static, and I honestly don't >> see much reason for it to be freeable? The current use-cases `cupy` or >> `pnumpy` don't not seem to need it. > > I don't know much about either of these use cases, so the following is > speculative. > In cupy, presumably the application is to tie allocation to a specific GPU > device. > Presumably then, somewhere in the python code there is a handle to a GPU > object, through which the allocators operate. > If that handle is stored in the allocator, and the allocator is freeable, > then it is possible to write code that automatically releases the GPU > handle after the allocator has been restored to the default and the last > array using it is cleaned up. > > If that cupy use-case seems somwhat plausible, then I think we should go > with the PyObject approach. > If it doesn't seem plausible, then I think the `ctx` approach is > acceptable, and we should consider declaring our struct > ```struct { PyMemAllocatorEx allocator; char const *name; }``` to reuse > the > existing python API unless there's a reason not to. Coming as a CuPy contributor here. The discussion of using this new NEP is not yet finalized in CuPy, so I am only speaking for the potential usage that I conceived. The original idea of using a custom NumPy allocator in CuPy (or any GPU library) is to allocate pinned / page-locked memory, which is on host (CPU). The idea is to explore the fact that device-host transfer is faster when pinned memory is in use. So, if I am calling arr_cpu = cupy.asnumpy(arr_gpu) to create a NumPy array and make a D2H transfer, and if I know arr_cpu's buffer is going to be reused several times, then it's better for it to be backed by pinned memory from the beginning. While there are tricks to achieve this, such a use pattern can be quite common in user codes, so it's much easier if the allocator can be configurable to avoid repeating boilerplates. An interesting note: this new custom allocator can be used to allocate managed/unified memory from CUDA. This memory lives in a unified address space so that both CPU and GPU can access it. I do not have much to say about this use case however. Now, I am not fully sure we need `void* ctx` or even make it a `PyObject`. My understanding (correct me if I am wrong!) is that the allocator state is considered internal. Imagine I set `alloc` in `PyDataMem_Handler` to be `alloc_my_mempool`, which has access to the internal of a memory pool class that manage a pool of pinned memory. Then whatever information should just be kept inside my mempool (including alignment, pool size, etc). I could implement the pool as a C++ class, and expose the alloc/free/etc member functions to C with some hacks. If using Cython, I suppose it's less hacky to expose a method of a cdef class. On the other hand, for pure C code life is probably easier if ctx is there. One way or another someone must keep a unique instance of that struct or class alive, so I do not have strong opinion. Best, Leo -- Sent from: http://numpy-discussion.10968.n7.nabble.com/ _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion