Re: [Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Dag Sverre Seljebotn Tue, 15 May 2012 04:20:19 -0700

On 05/15/2012 12:42 PM, mark florisson wrote:
> On 14 May 2012 21:36, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no>  wrote:
>> On 05/14/2012 06:31 PM, mark florisson wrote:
>>>
>>> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no>
>>>   wrote:
>>>>
>>>> On 05/11/2012 03:37 PM, mark florisson wrote:
>>>>>
>>>>>
>>>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no>
>>>>>   wrote:
>>>>>>
>>>>>>
>>>>>> (NumPy devs: I know, I get too many ideas. But this time I *really*
>>>>>> believe
>>>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's
>>>>>> not
>>>>>> going to be without manpower; and as his mentor I'd pitch in too here
>>>>>> and
>>>>>> there.)
>>>>>>
>>>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly
>>>>>> don't
>>>>>> want to micro-manage your GSoC, just have your take.)
>>>>>>
>>>>>> Travis, thank you very much for those good words in the "NA-mask
>>>>>> interactions..." thread. It put most of my concerns away. If anybody is
>>>>>> leaning towards for opaqueness because of its OOP purity, I want to
>>>>>> refer
>>>>>> to
>>>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4
>>>>>> different OOP array libraries, neither of which is able to out-compete
>>>>>> the
>>>>>> other. Meanwhile the rest of the world happily cooperates using
>>>>>> pointers,
>>>>>> strides, CSR and CSC.
>>>>>>
>>>>>> Now, there are limits to what you can do with strides and pointers.
>>>>>> Noone's
>>>>>> denying the need for more. In my mind that's an API where you can do
>>>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an
>>>>>> array;
>>>>>> but it might be something slightly different.
>>>>>>
>>>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the
>>>>>> NumPy C
>>>>>> API to deal with this issue.
>>>>>>
>>>>>> What we need is duck-typing/polymorphism at the C level. If you keep
>>>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many
>>>>>> relationship: One provider of array technology, multiple consumers
>>>>>> (with
>>>>>> hooks, I'm sure, but all implementations of the hook concept in the
>>>>>> NumPy
>>>>>> world I've seen so far are a total disaster!).
>>>>>>
>>>>>> What I think we need instead is something like PEP 3118 for the
>>>>>> "abstract"
>>>>>> array that is only available block-wise with getters and setters. On
>>>>>> the
>>>>>> Cython list we've decided that what we want for CEP 1000 (for boxing
>>>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could
>>>>>> create CEP 1001 to solve this issue and make any Python object an
>>>>>> exporter
>>>>>> of "block-getter/setter-arrays" (better name needed).
>>>>>>
>>>>>> What would be exported is (of course) a simple vtable:
>>>>>>
>>>>>> typedef struct {
>>>>>>     int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t
>>>>>> *lower_right,
>>>>>> ...);
>>>>>>     ...
>>>>>> } block_getter_setter_array_vtable;
>>>>>>
>>>>>> Let's please discuss the details *after* the fundamentals. But the
>>>>>> reason
>>>>>> I
>>>>>> put void* there instead of PyObject* is that I hope this could be used
>>>>>> beyond the Python world (say, Python<->Julia); the void* would be
>>>>>> handed
>>>>>> to
>>>>>> you at the time you receive the vtable (however we handle that).
>>>>>
>>>>>
>>>>>
>>>>> I suppose it would also be useful to have some way of predicting the
>>>>> output format polymorphically for the caller. E.g. dense *
>>>>> block_diagonal results in block diagonal, but dense + block_diagonal
>>>>> results in dense, etc. It might be useful for the caller to know
>>>>> whether it needs to allocate a sparse, dense or block-structured
>>>>> array. Or maybe the polymorphic function could even do the allocation.
>>>>> This needs to happen recursively of course, to avoid intermediate
>>>>> temporaries. The compiler could easily handle that, and so could numpy
>>>>> when it gets lazy evaluation.
>>>>
>>>>
>>>>
>>>> Ah. But that depends too on the computation to be performed too; a)
>>>> elementwise, b) axis-wise reductions, c) linear algebra...
>>>>
>>>> In my oomatrix code (please don't look at it, it's shameful) I do this
>>>> using
>>>> multiple dispatch.
>>>>
>>>> I'd rather ignore this for as long as we can, only implementing "a[:] =
>>>> ..."
>>>> -- I can't see how decisions here would trickle down to the API that's
>>>> used
>>>> in the kernel, it's more like a pre-phase, and better treated
>>>> orthogonally.
>>>>
>>>>
>>>>> I think if the heavy lifting of allocating output arrays and exporting
>>>>> these arrays work in numpy, then support in Cython could use that (I
>>>>> can already hear certain people object to more complicated array stuff
>>>>> in Cython :). Even better here would be an external project that each
>>>>> our projects could use (I still think the nditer sorting functionality
>>>>> of arrays should be numpy-agnostic and externally available).
>>>>
>>>>
>>>>
>>>> I agree with the separate project idea. It's trivial for NumPy to
>>>> incorporate that as one of its methods for exporting arrays, and I don't
>>>> think it makes sense to either build it into Cython, or outright depend
>>>> on
>>>> NumPy.
>>>>
>>>> Here's what I'd like (working title: NumBridge?).
>>>>
>>>>   - Mission: Be the "double* + shape + strides" in a world where that is
>>>> no
>>>> longer enough, by providing tight, focused APIs/ABIs that are usable
>>>> across
>>>> C/Fortran/Python.
>>>>
>>>> I basically want something I can quickly acquire from a NumPy array, then
>>>> pass it into my C code without dragging along all the cruft that I don't
>>>> need.
>>>>
>>>>   - Written in pure C + specs, usable without Python
>>>>
>>>>   - PEP 3118 "done right", basically semi-standardize the internal Cython
>>>> memoryview ABI and get something that's passable on stack
>>>>
>>>>   - Get block get/put API
>>>>
>>>>   - Iterator APIs
>>>>
>>>>   - Utility code for exporters and clients (iteration code, axis
>>>> reordering,
>>>> etc.)
>>>>
>>>> Is the scope of that insane, or is it at least worth a shot to see how
>>>> bad
>>>> it is? Beyond figuring out a small subset that can be done first, and
>>>> whether performance considerations must be taken or not, there's two
>>>> complicating factors: Pluggable dtypes, memory management. Perhaps you
>>>> could
>>>> come to Oslo for a couple of days to brainstorm...
>>>>
>>>> Dag
>>>
>>>
>>> The blocks are a good idea, but I think fairly complicated for
>>> complicated matrix layouts. It would be nice to have something that is
>>> reasonably efficient for at least most of the array storage
>>> mechanisms.
>>> I'm going to do a little brain dump below, let's see if anything is useful
>>> :)
>>>
>>> What if we basically take the CSR format and make it a little simpler,
>>> easier to handle, and better suited for other layouts. Basically, keep
>>> shape/strides for dense arrays, and for blocked arrays just "extend"
>>> your number of dimensions, i.e. a 2D blocked array becomes a 4D array,
>>> something like this:
>>>
>>>>>> a = np.arange(4).repeat(4).reshape(4, 4);
>>>>>> a
>>>
>>> array([[0, 0, 0, 0],
>>>             [1, 1, 1, 1],
>>>             [2, 2, 2, 2],
>>>             [3, 3, 3, 3]])
>>>
>>>>>> a.shape = (2, 2, 2, 2)
>>>>>> itemsize = a.dtype.itemsize
>>>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize)
>>>>>> a
>>>
>>> array([[[[0, 0],
>>>           [1, 1]],
>>>
>>>          [[0, 0],
>>>           [1, 1]]],
>>>
>>>
>>>         [[[2, 2],
>>>           [3, 3]],
>>>
>>>          [[2, 2],
>>>           [3, 3]]]])
>>>
>>>>>> print a.flatten()
>>>
>>> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3]
>>>
>>> Now, given some buffer flag (PyBUF_Sparse or something), use basically
>>> suboffsets with indirect dimensions, where only ever the last
>>> dimension is a row of contiguous memory (the entire thing may be
>>> contiguous, but the point is that you don't care). This row may
>>>
>>>      - be variable sized
>>>      - have either a single "column start" (e.g. diagonal, band/tri-
>>> diagonal, block diagonal, etc), OR
>>>      - a list of column indices, the length of the row (like in the CSR
>>> format)
>>>
>>> The length of each innermost row is then determined by looking at, in
>>> order:
>>>      - the extent as specified in the shape list
>>>      - if -1, and some flag is set, the extent is determined like CSR,
>>> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i]
>>>          ->    maybe instead of pointers indices are better, for
>>> serialization, GPUs, etc
>>>      - otherwise, use either a function pointer or perhaps a list of
>>> extents
>>>
>>> All these details will obviously be abstracted, allowing for easy
>>> iteration, but it can also be used by ufuncs operating on contiguous
>>> rows (often, since the actual storage is contiguous and stored in a 1D
>>> array, some flag could indicate contiguity as an optimization for
>>> unary ufuncs and flat iteration). Tiled nditer-ation could also work
>>> here without too big a hassle.
>>> When you slice, you add to the suboffset and manipulate the single
>>> extent or entire list of extents in that dimension, and the flag to
>>> determine the length using the pointer subtraction should be cleared
>>> (this should probably all happen through vtable functions).
>>>
>>> An exporter would also be free to use different malloced pointers,
>>> allowing variable sized array support with append/pop functionality in
>>> multiple dimensions (if there are no active buffer views).
>>>
>>> Random access in the case where a column start is provided is still
>>> contant time, and done though a[i][j][k - colstart], unless you have
>>> discontiguous rows, in which case you are allowed a logarithmic search
>>> (if the size exceeds some threshold). I see scipy.sparse does a linear
>>> search, which is pretty slow in pathological cases:
>>>
>>> from scipy import sparse
>>> a = np.random.random((1, 4000000))
>>> b = sparse.csr_matrix(a)
>>> %timeit a[0, 1000]
>>> %timeit b[0, 1000]
>>>
>>> 10000000 loops, best of 3: 190 ns per loop
>>> 10 loops, best of 3: 29.3 ms per loop
>>
>>
>> Heh. That is *extremely* pathological though, nobody does that in real code
>> :-)
>>
>> Here's an idea I had yesterday: To get an ND sparse array with good spatial
>> locality (for local operations etc.), you could map the elements to a
>> volume-filling fractal and then use a hash-map with linear probing. I bet it
>> either doesn't work or has been done already :-)
>>
>>
>>>
>>> Now, depending on the density and operation, the caller may have some
>>> idea of how to allocate an output array. I'm not sure how to handle
>>> "insertions" of new elements, but I presume through vtable put/insert
>>> functions. I'm also not sure how to fit this in with linear algebra
>>> functionality, other than providing conversions of the view.
>>>
>>> I think a disadvantage of this scheme is that you can't reorder your
>>> axes anymore, and many operations that are easy in the dense case are
>>> suddenly harder, e.g. this scheme does not allow you to go from a
>>> csr-like format into csc. But I think what this gives is reasonable
>>> generality to allow easy use in C/Fortran/Cython compiled code/numpy
>>> ufunc invocation, as well as allowing efficient-ish storage for
>>> various kinds of arrays.
>>>
>>> Does this make any sense?
>>
>>
>> I'll admit I didn't go through the finer details of your idea; let's deal
>> with the below first and then I can re-read your post later.
>>
>> What I'm thinking is that this seems interesting, but perhaps it lowers the
>> versatility so much that it's not really worth to consider, for the GSoC at
>> least. If the impact isn't high enough, my hunch is that one may as well not
>> bother and just do strided arrays.
>>
>> I actually believe that the *likely* outcome of this discussion is to stick
>> to the original plan and focus on expressions with strided arrays. But let's
>> see.
>>
>> I'm not sure if what brought me to the blocked case was really block-sparse
>> arrays or diagonal arrays. Let's see...:
>>
>> 1) Memory conservation. The array would not be NOT element-sparse, it's just
>> that you don't want to hold it all in memory at once.
>>
>> Examples:
>>
>>   - "a_array *= np.random.normal(size=a_array.shape)". The right hand side
>> can be generated on the fly (making it return the same data for each block
>> every time is non-trivial but can be done). If a takes up a good chunk of
>> RAM, there might not even be enough memory for the right-hand-side except
>> for block-by-block. (That saves bandwidth too.)
>>
>>   - You want to stream from one file to another file, neither of which will
>> fit in RAM. Here we really don't care about speed, just code reuse...it's
>> irritating to have to manually block in such a situation.
>>
>>   - You want to play with a new fancy array format (a blocked format that's
>> faster for some linear algebra, say). But then you need to call another C
>> library that takes data+shape+stride+dtype. Then you need to make a copy,
>> which you'd rather avoid -- so an alternative is to make that C library be
>> based on the block API so that it can interfaces with your fancy new format
>> (and anything else).
>>
>> 2) Bandwidth conservation. Numexpr, Theano, Numba and
>> Cython-vectorized-expressions all already deal with this problem on one
>> level.
>>
>> However, I also believe there's a higher level in many programs where blocks
>> come into play. The organization of many scientific codes essentially reads
>> "do A on 8 GB of data, then do B on 8 GB of data"; and it's going to be a
>> *long* time before you have full-program analysis to block up that in every
>> case; the tools above will be able to deal with some almost-trivial cases
>> only.
>>
>> A block API could be used to write "pipeline programs". For instance,
>> consider
>>
>> a_array[:, None] * f(x_array)
>>
>> and f is some rather complicated routine in Fortran that is NOT a ufunc --
>> it takes all of x_array as input and provides all the output "at once"; but
>> at some point it needs to do the writing of the output, and if writing to an
>> API it could do the multiplication with a_array while the block is in cache
>> anyway and save a memory bus round-trip. (Provided the output isn't needed
>> as scratch space.)
>>
>> Summing up:
>>
>> Vectorized expressions on contiguous (or strided) memory in Cython is pretty
>> nice by itself; it would bring us up to the current state of the art of
>> static compilation (in Fortran compilers).
>>
>> But my hunch is your sparse-block proposal doesn't add enough flexibility to
>> be worth the pain. Many of the cases above can't be covered with it.
>>
>> If it's possible to identify a little nugget API that is forward-compatible
>> with the above usecases (it wouldn't solve them perhaps, but allow them to
>> be solved with some extra supporting code), then it might be worth to a)
>> implement it in NumPy, b) support it for Cython vectorized expressions; and
>> use that to support block-transposition.
>>
>> But I'm starting to feel that we can't really know how that little nugget
>> API should really look until the above has been explored in a little more
>> depth; and then we are easily talking a too large scope without tying it to
>> a larger project (which can't really before the GSoC..). For instance, 2)
>> suggests push/pull rather than put/get.
>>
>> Dag
>
> I assumed as much, in any case I was going to start with dense arrays,
> simply because they are in common use and well-defined at this point.
> Maybe what we really want is just lazy evaluation that works for
> different array layouts and storage mechanisms, and a smart JIT that
> can evaluate our linear algebra and array expressions in an optimal
> fashion (memory/cache-wise, but also out-of-core wise). This probably
> means writing everything in a high level language, because even if
> your Fortran routines themselves would use your lazy evaluation
> library, your linear algebra library wouldn't for sure :) Anyway, I'm


My feeling is there should definitely be some very sweet spot somewhere 
that's not as good as everything-lazily-evaluated/full program analysis, 
but still much better than where we are today.

Of course, that doesn't mean that you should bother about it :-)

> going to be pragmatic and investigate how much of Theano we can reuse
> in Cython now.

Sounds good. At least now the larger world of blocking is in the back of 
our minds as things proceed.

Dag
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Fixing issue of future opaqueness of ndarray this summer

Reply via email to