On 05/14/2012 06:31 PM, mark florisson wrote: > On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> wrote: >> On 05/11/2012 03:37 PM, mark florisson wrote: >>> >>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >>> wrote: >>>> >>>> (NumPy devs: I know, I get too many ideas. But this time I *really* >>>> believe >>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's >>>> not >>>> going to be without manpower; and as his mentor I'd pitch in too here and >>>> there.) >>>> >>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly >>>> don't >>>> want to micro-manage your GSoC, just have your take.) >>>> >>>> Travis, thank you very much for those good words in the "NA-mask >>>> interactions..." thread. It put most of my concerns away. If anybody is >>>> leaning towards for opaqueness because of its OOP purity, I want to refer >>>> to >>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4 >>>> different OOP array libraries, neither of which is able to out-compete >>>> the >>>> other. Meanwhile the rest of the world happily cooperates using pointers, >>>> strides, CSR and CSC. >>>> >>>> Now, there are limits to what you can do with strides and pointers. >>>> Noone's >>>> denying the need for more. In my mind that's an API where you can do >>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an >>>> array; >>>> but it might be something slightly different. >>>> >>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the >>>> NumPy C >>>> API to deal with this issue. >>>> >>>> What we need is duck-typing/polymorphism at the C level. If you keep >>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many >>>> relationship: One provider of array technology, multiple consumers (with >>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy >>>> world I've seen so far are a total disaster!). >>>> >>>> What I think we need instead is something like PEP 3118 for the >>>> "abstract" >>>> array that is only available block-wise with getters and setters. On the >>>> Cython list we've decided that what we want for CEP 1000 (for boxing >>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could >>>> create CEP 1001 to solve this issue and make any Python object an >>>> exporter >>>> of "block-getter/setter-arrays" (better name needed). >>>> >>>> What would be exported is (of course) a simple vtable: >>>> >>>> typedef struct { >>>> int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t *lower_right, >>>> ...); >>>> ... >>>> } block_getter_setter_array_vtable; >>>> >>>> Let's please discuss the details *after* the fundamentals. But the reason >>>> I >>>> put void* there instead of PyObject* is that I hope this could be used >>>> beyond the Python world (say, Python<->Julia); the void* would be handed >>>> to >>>> you at the time you receive the vtable (however we handle that). >>> >>> >>> I suppose it would also be useful to have some way of predicting the >>> output format polymorphically for the caller. E.g. dense * >>> block_diagonal results in block diagonal, but dense + block_diagonal >>> results in dense, etc. It might be useful for the caller to know >>> whether it needs to allocate a sparse, dense or block-structured >>> array. Or maybe the polymorphic function could even do the allocation. >>> This needs to happen recursively of course, to avoid intermediate >>> temporaries. The compiler could easily handle that, and so could numpy >>> when it gets lazy evaluation. >> >> >> Ah. But that depends too on the computation to be performed too; a) >> elementwise, b) axis-wise reductions, c) linear algebra... >> >> In my oomatrix code (please don't look at it, it's shameful) I do this using >> multiple dispatch. >> >> I'd rather ignore this for as long as we can, only implementing "a[:] = ..." >> -- I can't see how decisions here would trickle down to the API that's used >> in the kernel, it's more like a pre-phase, and better treated orthogonally. >> >> >>> I think if the heavy lifting of allocating output arrays and exporting >>> these arrays work in numpy, then support in Cython could use that (I >>> can already hear certain people object to more complicated array stuff >>> in Cython :). Even better here would be an external project that each >>> our projects could use (I still think the nditer sorting functionality >>> of arrays should be numpy-agnostic and externally available). >> >> >> I agree with the separate project idea. It's trivial for NumPy to >> incorporate that as one of its methods for exporting arrays, and I don't >> think it makes sense to either build it into Cython, or outright depend on >> NumPy. >> >> Here's what I'd like (working title: NumBridge?). >> >> - Mission: Be the "double* + shape + strides" in a world where that is no >> longer enough, by providing tight, focused APIs/ABIs that are usable across >> C/Fortran/Python. >> >> I basically want something I can quickly acquire from a NumPy array, then >> pass it into my C code without dragging along all the cruft that I don't >> need. >> >> - Written in pure C + specs, usable without Python >> >> - PEP 3118 "done right", basically semi-standardize the internal Cython >> memoryview ABI and get something that's passable on stack >> >> - Get block get/put API >> >> - Iterator APIs >> >> - Utility code for exporters and clients (iteration code, axis reordering, >> etc.) >> >> Is the scope of that insane, or is it at least worth a shot to see how bad >> it is? Beyond figuring out a small subset that can be done first, and >> whether performance considerations must be taken or not, there's two >> complicating factors: Pluggable dtypes, memory management. Perhaps you could >> come to Oslo for a couple of days to brainstorm... >> >> Dag > > The blocks are a good idea, but I think fairly complicated for > complicated matrix layouts. It would be nice to have something that is > reasonably efficient for at least most of the array storage > mechanisms. > I'm going to do a little brain dump below, let's see if anything is useful :) > > What if we basically take the CSR format and make it a little simpler, > easier to handle, and better suited for other layouts. Basically, keep > shape/strides for dense arrays, and for blocked arrays just "extend" > your number of dimensions, i.e. a 2D blocked array becomes a 4D array, > something like this: > >>>> a = np.arange(4).repeat(4).reshape(4, 4); >>>> a > array([[0, 0, 0, 0], > [1, 1, 1, 1], > [2, 2, 2, 2], > [3, 3, 3, 3]]) > >>>> a.shape = (2, 2, 2, 2) >>>> itemsize = a.dtype.itemsize >>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize) >>>> a > array([[[[0, 0], > [1, 1]], > > [[0, 0], > [1, 1]]], > > > [[[2, 2], > [3, 3]], > > [[2, 2], > [3, 3]]]]) > >>>> print a.flatten() > [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3] > > Now, given some buffer flag (PyBUF_Sparse or something), use basically > suboffsets with indirect dimensions, where only ever the last > dimension is a row of contiguous memory (the entire thing may be > contiguous, but the point is that you don't care). This row may > > - be variable sized > - have either a single "column start" (e.g. diagonal, band/tri- > diagonal, block diagonal, etc), OR > - a list of column indices, the length of the row (like in the CSR > format) > > The length of each innermost row is then determined by looking at, in order: > - the extent as specified in the shape list > - if -1, and some flag is set, the extent is determined like CSR, > i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i] > -> maybe instead of pointers indices are better, for > serialization, GPUs, etc > - otherwise, use either a function pointer or perhaps a list of extents > > All these details will obviously be abstracted, allowing for easy > iteration, but it can also be used by ufuncs operating on contiguous > rows (often, since the actual storage is contiguous and stored in a 1D > array, some flag could indicate contiguity as an optimization for > unary ufuncs and flat iteration). Tiled nditer-ation could also work > here without too big a hassle. > When you slice, you add to the suboffset and manipulate the single > extent or entire list of extents in that dimension, and the flag to > determine the length using the pointer subtraction should be cleared > (this should probably all happen through vtable functions). > > An exporter would also be free to use different malloced pointers, > allowing variable sized array support with append/pop functionality in > multiple dimensions (if there are no active buffer views). > > Random access in the case where a column start is provided is still > contant time, and done though a[i][j][k - colstart], unless you have > discontiguous rows, in which case you are allowed a logarithmic search > (if the size exceeds some threshold). I see scipy.sparse does a linear > search, which is pretty slow in pathological cases: > > from scipy import sparse > a = np.random.random((1, 4000000)) > b = sparse.csr_matrix(a) > %timeit a[0, 1000] > %timeit b[0, 1000] > > 10000000 loops, best of 3: 190 ns per loop > 10 loops, best of 3: 29.3 ms per loop
Heh. That is *extremely* pathological though, nobody does that in real code :-) Here's an idea I had yesterday: To get an ND sparse array with good spatial locality (for local operations etc.), you could map the elements to a volume-filling fractal and then use a hash-map with linear probing. I bet it either doesn't work or has been done already :-) > > Now, depending on the density and operation, the caller may have some > idea of how to allocate an output array. I'm not sure how to handle > "insertions" of new elements, but I presume through vtable put/insert > functions. I'm also not sure how to fit this in with linear algebra > functionality, other than providing conversions of the view. > > I think a disadvantage of this scheme is that you can't reorder your > axes anymore, and many operations that are easy in the dense case are > suddenly harder, e.g. this scheme does not allow you to go from a > csr-like format into csc. But I think what this gives is reasonable > generality to allow easy use in C/Fortran/Cython compiled code/numpy > ufunc invocation, as well as allowing efficient-ish storage for > various kinds of arrays. > > Does this make any sense? I'll admit I didn't go through the finer details of your idea; let's deal with the below first and then I can re-read your post later. What I'm thinking is that this seems interesting, but perhaps it lowers the versatility so much that it's not really worth to consider, for the GSoC at least. If the impact isn't high enough, my hunch is that one may as well not bother and just do strided arrays. I actually believe that the *likely* outcome of this discussion is to stick to the original plan and focus on expressions with strided arrays. But let's see. I'm not sure if what brought me to the blocked case was really block-sparse arrays or diagonal arrays. Let's see...: 1) Memory conservation. The array would not be NOT element-sparse, it's just that you don't want to hold it all in memory at once. Examples: - "a_array *= np.random.normal(size=a_array.shape)". The right hand side can be generated on the fly (making it return the same data for each block every time is non-trivial but can be done). If a takes up a good chunk of RAM, there might not even be enough memory for the right-hand-side except for block-by-block. (That saves bandwidth too.) - You want to stream from one file to another file, neither of which will fit in RAM. Here we really don't care about speed, just code reuse...it's irritating to have to manually block in such a situation. - You want to play with a new fancy array format (a blocked format that's faster for some linear algebra, say). But then you need to call another C library that takes data+shape+stride+dtype. Then you need to make a copy, which you'd rather avoid -- so an alternative is to make that C library be based on the block API so that it can interfaces with your fancy new format (and anything else). 2) Bandwidth conservation. Numexpr, Theano, Numba and Cython-vectorized-expressions all already deal with this problem on one level. However, I also believe there's a higher level in many programs where blocks come into play. The organization of many scientific codes essentially reads "do A on 8 GB of data, then do B on 8 GB of data"; and it's going to be a *long* time before you have full-program analysis to block up that in every case; the tools above will be able to deal with some almost-trivial cases only. A block API could be used to write "pipeline programs". For instance, consider a_array[:, None] * f(x_array) and f is some rather complicated routine in Fortran that is NOT a ufunc -- it takes all of x_array as input and provides all the output "at once"; but at some point it needs to do the writing of the output, and if writing to an API it could do the multiplication with a_array while the block is in cache anyway and save a memory bus round-trip. (Provided the output isn't needed as scratch space.) Summing up: Vectorized expressions on contiguous (or strided) memory in Cython is pretty nice by itself; it would bring us up to the current state of the art of static compilation (in Fortran compilers). But my hunch is your sparse-block proposal doesn't add enough flexibility to be worth the pain. Many of the cases above can't be covered with it. If it's possible to identify a little nugget API that is forward-compatible with the above usecases (it wouldn't solve them perhaps, but allow them to be solved with some extra supporting code), then it might be worth to a) implement it in NumPy, b) support it for Cython vectorized expressions; and use that to support block-transposition. But I'm starting to feel that we can't really know how that little nugget API should really look until the above has been explored in a little more depth; and then we are easily talking a too large scope without tying it to a larger project (which can't really before the GSoC..). For instance, 2) suggests push/pull rather than put/get. Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion