On 05/14/2012 10:36 PM, Dag Sverre Seljebotn wrote: > On 05/14/2012 06:31 PM, mark florisson wrote: >> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >> wrote: >>> On 05/11/2012 03:37 PM, mark florisson wrote: >>>> >>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >>>> wrote: >>>>> >>>>> (NumPy devs: I know, I get too many ideas. But this time I *really* >>>>> believe >>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's >>>>> not >>>>> going to be without manpower; and as his mentor I'd pitch in too here and >>>>> there.) >>>>> >>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly >>>>> don't >>>>> want to micro-manage your GSoC, just have your take.) >>>>> >>>>> Travis, thank you very much for those good words in the "NA-mask >>>>> interactions..." thread. It put most of my concerns away. If anybody is >>>>> leaning towards for opaqueness because of its OOP purity, I want to refer >>>>> to >>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4 >>>>> different OOP array libraries, neither of which is able to out-compete >>>>> the >>>>> other. Meanwhile the rest of the world happily cooperates using pointers, >>>>> strides, CSR and CSC. >>>>> >>>>> Now, there are limits to what you can do with strides and pointers. >>>>> Noone's >>>>> denying the need for more. In my mind that's an API where you can do >>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an >>>>> array; >>>>> but it might be something slightly different. >>>>> >>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the >>>>> NumPy C >>>>> API to deal with this issue. >>>>> >>>>> What we need is duck-typing/polymorphism at the C level. If you keep >>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many >>>>> relationship: One provider of array technology, multiple consumers (with >>>>> hooks, I'm sure, but all implementations of the hook concept in the NumPy >>>>> world I've seen so far are a total disaster!). >>>>> >>>>> What I think we need instead is something like PEP 3118 for the >>>>> "abstract" >>>>> array that is only available block-wise with getters and setters. On the >>>>> Cython list we've decided that what we want for CEP 1000 (for boxing >>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could >>>>> create CEP 1001 to solve this issue and make any Python object an >>>>> exporter >>>>> of "block-getter/setter-arrays" (better name needed). >>>>> >>>>> What would be exported is (of course) a simple vtable: >>>>> >>>>> typedef struct { >>>>> int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t >>>>> *lower_right, >>>>> ...); >>>>> ... >>>>> } block_getter_setter_array_vtable; >>>>> >>>>> Let's please discuss the details *after* the fundamentals. But the reason >>>>> I >>>>> put void* there instead of PyObject* is that I hope this could be used >>>>> beyond the Python world (say, Python<->Julia); the void* would be handed >>>>> to >>>>> you at the time you receive the vtable (however we handle that). >>>> >>>> >>>> I suppose it would also be useful to have some way of predicting the >>>> output format polymorphically for the caller. E.g. dense * >>>> block_diagonal results in block diagonal, but dense + block_diagonal >>>> results in dense, etc. It might be useful for the caller to know >>>> whether it needs to allocate a sparse, dense or block-structured >>>> array. Or maybe the polymorphic function could even do the allocation. >>>> This needs to happen recursively of course, to avoid intermediate >>>> temporaries. The compiler could easily handle that, and so could numpy >>>> when it gets lazy evaluation. >>> >>> >>> Ah. But that depends too on the computation to be performed too; a) >>> elementwise, b) axis-wise reductions, c) linear algebra... >>> >>> In my oomatrix code (please don't look at it, it's shameful) I do this using >>> multiple dispatch. >>> >>> I'd rather ignore this for as long as we can, only implementing "a[:] = ..." >>> -- I can't see how decisions here would trickle down to the API that's used >>> in the kernel, it's more like a pre-phase, and better treated orthogonally. >>> >>> >>>> I think if the heavy lifting of allocating output arrays and exporting >>>> these arrays work in numpy, then support in Cython could use that (I >>>> can already hear certain people object to more complicated array stuff >>>> in Cython :). Even better here would be an external project that each >>>> our projects could use (I still think the nditer sorting functionality >>>> of arrays should be numpy-agnostic and externally available). >>> >>> >>> I agree with the separate project idea. It's trivial for NumPy to >>> incorporate that as one of its methods for exporting arrays, and I don't >>> think it makes sense to either build it into Cython, or outright depend on >>> NumPy. >>> >>> Here's what I'd like (working title: NumBridge?). >>> >>> - Mission: Be the "double* + shape + strides" in a world where that is no >>> longer enough, by providing tight, focused APIs/ABIs that are usable across >>> C/Fortran/Python. >>> >>> I basically want something I can quickly acquire from a NumPy array, then >>> pass it into my C code without dragging along all the cruft that I don't >>> need. >>> >>> - Written in pure C + specs, usable without Python >>> >>> - PEP 3118 "done right", basically semi-standardize the internal Cython >>> memoryview ABI and get something that's passable on stack >>> >>> - Get block get/put API >>> >>> - Iterator APIs >>> >>> - Utility code for exporters and clients (iteration code, axis >>> reordering, >>> etc.) >>> >>> Is the scope of that insane, or is it at least worth a shot to see how bad >>> it is? Beyond figuring out a small subset that can be done first, and >>> whether performance considerations must be taken or not, there's two >>> complicating factors: Pluggable dtypes, memory management. Perhaps you could >>> come to Oslo for a couple of days to brainstorm... >>> >>> Dag >> >> The blocks are a good idea, but I think fairly complicated for >> complicated matrix layouts. It would be nice to have something that is >> reasonably efficient for at least most of the array storage >> mechanisms. >> I'm going to do a little brain dump below, let's see if anything is useful :) >> >> What if we basically take the CSR format and make it a little simpler, >> easier to handle, and better suited for other layouts. Basically, keep >> shape/strides for dense arrays, and for blocked arrays just "extend" >> your number of dimensions, i.e. a 2D blocked array becomes a 4D array, >> something like this: >> >>>>> a = np.arange(4).repeat(4).reshape(4, 4); >>>>> a >> array([[0, 0, 0, 0], >> [1, 1, 1, 1], >> [2, 2, 2, 2], >> [3, 3, 3, 3]]) >> >>>>> a.shape = (2, 2, 2, 2) >>>>> itemsize = a.dtype.itemsize >>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize) >>>>> a >> array([[[[0, 0], >> [1, 1]], >> >> [[0, 0], >> [1, 1]]], >> >> >> [[[2, 2], >> [3, 3]], >> >> [[2, 2], >> [3, 3]]]]) >> >>>>> print a.flatten() >> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3] >> >> Now, given some buffer flag (PyBUF_Sparse or something), use basically >> suboffsets with indirect dimensions, where only ever the last >> dimension is a row of contiguous memory (the entire thing may be >> contiguous, but the point is that you don't care). This row may >> >> - be variable sized >> - have either a single "column start" (e.g. diagonal, band/tri- >> diagonal, block diagonal, etc), OR >> - a list of column indices, the length of the row (like in the CSR >> format) >> >> The length of each innermost row is then determined by looking at, in order: >> - the extent as specified in the shape list >> - if -1, and some flag is set, the extent is determined like CSR, >> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i] >> -> maybe instead of pointers indices are better, for >> serialization, GPUs, etc >> - otherwise, use either a function pointer or perhaps a list of extents >> >> All these details will obviously be abstracted, allowing for easy >> iteration, but it can also be used by ufuncs operating on contiguous >> rows (often, since the actual storage is contiguous and stored in a 1D >> array, some flag could indicate contiguity as an optimization for >> unary ufuncs and flat iteration). Tiled nditer-ation could also work >> here without too big a hassle. >> When you slice, you add to the suboffset and manipulate the single >> extent or entire list of extents in that dimension, and the flag to >> determine the length using the pointer subtraction should be cleared >> (this should probably all happen through vtable functions). >> >> An exporter would also be free to use different malloced pointers, >> allowing variable sized array support with append/pop functionality in >> multiple dimensions (if there are no active buffer views). >> >> Random access in the case where a column start is provided is still >> contant time, and done though a[i][j][k - colstart], unless you have >> discontiguous rows, in which case you are allowed a logarithmic search >> (if the size exceeds some threshold). I see scipy.sparse does a linear >> search, which is pretty slow in pathological cases: >> >> from scipy import sparse >> a = np.random.random((1, 4000000)) >> b = sparse.csr_matrix(a) >> %timeit a[0, 1000] >> %timeit b[0, 1000] >> >> 10000000 loops, best of 3: 190 ns per loop >> 10 loops, best of 3: 29.3 ms per loop > > Heh. That is *extremely* pathological though, nobody does that in real > code :-) > > Here's an idea I had yesterday: To get an ND sparse array with good > spatial locality (for local operations etc.), you could map the elements > to a volume-filling fractal and then use a hash-map with linear probing. > I bet it either doesn't work or has been done already :-) > >> >> Now, depending on the density and operation, the caller may have some >> idea of how to allocate an output array. I'm not sure how to handle >> "insertions" of new elements, but I presume through vtable put/insert >> functions. I'm also not sure how to fit this in with linear algebra >> functionality, other than providing conversions of the view. >> >> I think a disadvantage of this scheme is that you can't reorder your >> axes anymore, and many operations that are easy in the dense case are >> suddenly harder, e.g. this scheme does not allow you to go from a >> csr-like format into csc. But I think what this gives is reasonable >> generality to allow easy use in C/Fortran/Cython compiled code/numpy >> ufunc invocation, as well as allowing efficient-ish storage for >> various kinds of arrays. >> >> Does this make any sense? > > I'll admit I didn't go through the finer details of your idea; let's > deal with the below first and then I can re-read your post later. > > What I'm thinking is that this seems interesting, but perhaps it lowers > the versatility so much that it's not really worth to consider, for the > GSoC at least. If the impact isn't high enough, my hunch is that one may > as well not bother and just do strided arrays. > > I actually believe that the *likely* outcome of this discussion is to > stick to the original plan and focus on expressions with strided arrays. > But let's see. > > I'm not sure if what brought me to the blocked case was really > block-sparse arrays or diagonal arrays. Let's see...: > > 1) Memory conservation. The array would not be NOT element-sparse, it's > just that you don't want to hold it all in memory at once. > > Examples: > > - "a_array *= np.random.normal(size=a_array.shape)". The right hand > side can be generated on the fly (making it return the same data for > each block every time is non-trivial but can be done). If a takes up a > good chunk of RAM, there might not even be enough memory for the > right-hand-side except for block-by-block. (That saves bandwidth too.) > > - You want to stream from one file to another file, neither of which > will fit in RAM. Here we really don't care about speed, just code > reuse...it's irritating to have to manually block in such a situation.
Actually, s/file/database/g, just to avoid comments about np.memmap. Dag > > - You want to play with a new fancy array format (a blocked format > that's faster for some linear algebra, say). But then you need to call > another C library that takes data+shape+stride+dtype. Then you need to > make a copy, which you'd rather avoid -- so an alternative is to make > that C library be based on the block API so that it can interfaces with > your fancy new format (and anything else). > > 2) Bandwidth conservation. Numexpr, Theano, Numba and > Cython-vectorized-expressions all already deal with this problem on one > level. > > However, I also believe there's a higher level in many programs where > blocks come into play. The organization of many scientific codes > essentially reads "do A on 8 GB of data, then do B on 8 GB of data"; and > it's going to be a *long* time before you have full-program analysis to > block up that in every case; the tools above will be able to deal with > some almost-trivial cases only. > > A block API could be used to write "pipeline programs". For instance, > consider > > a_array[:, None] * f(x_array) > > and f is some rather complicated routine in Fortran that is NOT a ufunc > -- it takes all of x_array as input and provides all the output "at > once"; but at some point it needs to do the writing of the output, and > if writing to an API it could do the multiplication with a_array while > the block is in cache anyway and save a memory bus round-trip. (Provided > the output isn't needed as scratch space.) > > Summing up: > > Vectorized expressions on contiguous (or strided) memory in Cython is > pretty nice by itself; it would bring us up to the current state of the > art of static compilation (in Fortran compilers). > > But my hunch is your sparse-block proposal doesn't add enough > flexibility to be worth the pain. Many of the cases above can't be > covered with it. > > If it's possible to identify a little nugget API that is > forward-compatible with the above usecases (it wouldn't solve them > perhaps, but allow them to be solved with some extra supporting code), > then it might be worth to a) implement it in NumPy, b) support it for > Cython vectorized expressions; and use that to support block-transposition. > > But I'm starting to feel that we can't really know how that little > nugget API should really look until the above has been explored in a > little more depth; and then we are easily talking a too large scope > without tying it to a larger project (which can't really before the > GSoC..). For instance, 2) suggests push/pull rather than put/get. > > Dag > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion