On 05/15/2012 12:42 PM, mark florisson wrote: > On 14 May 2012 21:36, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> wrote: >> On 05/14/2012 06:31 PM, mark florisson wrote: >>> >>> On 12 May 2012 22:55, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >>> wrote: >>>> >>>> On 05/11/2012 03:37 PM, mark florisson wrote: >>>>> >>>>> >>>>> On 11 May 2012 12:13, Dag Sverre Seljebotn<d.s.seljeb...@astro.uio.no> >>>>> wrote: >>>>>> >>>>>> >>>>>> (NumPy devs: I know, I get too many ideas. But this time I *really* >>>>>> believe >>>>>> in it, I think this is going to be *huge*. And if Mark F. likes it it's >>>>>> not >>>>>> going to be without manpower; and as his mentor I'd pitch in too here >>>>>> and >>>>>> there.) >>>>>> >>>>>> (Mark F.: I believe this is *very* relevant to your GSoC. I certainly >>>>>> don't >>>>>> want to micro-manage your GSoC, just have your take.) >>>>>> >>>>>> Travis, thank you very much for those good words in the "NA-mask >>>>>> interactions..." thread. It put most of my concerns away. If anybody is >>>>>> leaning towards for opaqueness because of its OOP purity, I want to >>>>>> refer >>>>>> to >>>>>> C++ and its walled-garden of ideological purity -- it has, what, 3-4 >>>>>> different OOP array libraries, neither of which is able to out-compete >>>>>> the >>>>>> other. Meanwhile the rest of the world happily cooperates using >>>>>> pointers, >>>>>> strides, CSR and CSC. >>>>>> >>>>>> Now, there are limits to what you can do with strides and pointers. >>>>>> Noone's >>>>>> denying the need for more. In my mind that's an API where you can do >>>>>> fetch_block and put_block of cache-sized, N-dimensional blocks on an >>>>>> array; >>>>>> but it might be something slightly different. >>>>>> >>>>>> Here's what I'm asking: DO NOT simply keep extending ndarray and the >>>>>> NumPy C >>>>>> API to deal with this issue. >>>>>> >>>>>> What we need is duck-typing/polymorphism at the C level. If you keep >>>>>> extending ndarray and the NumPy C API, what we'll have is a one-to-many >>>>>> relationship: One provider of array technology, multiple consumers >>>>>> (with >>>>>> hooks, I'm sure, but all implementations of the hook concept in the >>>>>> NumPy >>>>>> world I've seen so far are a total disaster!). >>>>>> >>>>>> What I think we need instead is something like PEP 3118 for the >>>>>> "abstract" >>>>>> array that is only available block-wise with getters and setters. On >>>>>> the >>>>>> Cython list we've decided that what we want for CEP 1000 (for boxing >>>>>> callbacks etc.) is to extend PyTypeObject with our own fields; we could >>>>>> create CEP 1001 to solve this issue and make any Python object an >>>>>> exporter >>>>>> of "block-getter/setter-arrays" (better name needed). >>>>>> >>>>>> What would be exported is (of course) a simple vtable: >>>>>> >>>>>> typedef struct { >>>>>> int (*get_block)(void *ctx, ssize_t *upper_left, ssize_t >>>>>> *lower_right, >>>>>> ...); >>>>>> ... >>>>>> } block_getter_setter_array_vtable; >>>>>> >>>>>> Let's please discuss the details *after* the fundamentals. But the >>>>>> reason >>>>>> I >>>>>> put void* there instead of PyObject* is that I hope this could be used >>>>>> beyond the Python world (say, Python<->Julia); the void* would be >>>>>> handed >>>>>> to >>>>>> you at the time you receive the vtable (however we handle that). >>>>> >>>>> >>>>> >>>>> I suppose it would also be useful to have some way of predicting the >>>>> output format polymorphically for the caller. E.g. dense * >>>>> block_diagonal results in block diagonal, but dense + block_diagonal >>>>> results in dense, etc. It might be useful for the caller to know >>>>> whether it needs to allocate a sparse, dense or block-structured >>>>> array. Or maybe the polymorphic function could even do the allocation. >>>>> This needs to happen recursively of course, to avoid intermediate >>>>> temporaries. The compiler could easily handle that, and so could numpy >>>>> when it gets lazy evaluation. >>>> >>>> >>>> >>>> Ah. But that depends too on the computation to be performed too; a) >>>> elementwise, b) axis-wise reductions, c) linear algebra... >>>> >>>> In my oomatrix code (please don't look at it, it's shameful) I do this >>>> using >>>> multiple dispatch. >>>> >>>> I'd rather ignore this for as long as we can, only implementing "a[:] = >>>> ..." >>>> -- I can't see how decisions here would trickle down to the API that's >>>> used >>>> in the kernel, it's more like a pre-phase, and better treated >>>> orthogonally. >>>> >>>> >>>>> I think if the heavy lifting of allocating output arrays and exporting >>>>> these arrays work in numpy, then support in Cython could use that (I >>>>> can already hear certain people object to more complicated array stuff >>>>> in Cython :). Even better here would be an external project that each >>>>> our projects could use (I still think the nditer sorting functionality >>>>> of arrays should be numpy-agnostic and externally available). >>>> >>>> >>>> >>>> I agree with the separate project idea. It's trivial for NumPy to >>>> incorporate that as one of its methods for exporting arrays, and I don't >>>> think it makes sense to either build it into Cython, or outright depend >>>> on >>>> NumPy. >>>> >>>> Here's what I'd like (working title: NumBridge?). >>>> >>>> - Mission: Be the "double* + shape + strides" in a world where that is >>>> no >>>> longer enough, by providing tight, focused APIs/ABIs that are usable >>>> across >>>> C/Fortran/Python. >>>> >>>> I basically want something I can quickly acquire from a NumPy array, then >>>> pass it into my C code without dragging along all the cruft that I don't >>>> need. >>>> >>>> - Written in pure C + specs, usable without Python >>>> >>>> - PEP 3118 "done right", basically semi-standardize the internal Cython >>>> memoryview ABI and get something that's passable on stack >>>> >>>> - Get block get/put API >>>> >>>> - Iterator APIs >>>> >>>> - Utility code for exporters and clients (iteration code, axis >>>> reordering, >>>> etc.) >>>> >>>> Is the scope of that insane, or is it at least worth a shot to see how >>>> bad >>>> it is? Beyond figuring out a small subset that can be done first, and >>>> whether performance considerations must be taken or not, there's two >>>> complicating factors: Pluggable dtypes, memory management. Perhaps you >>>> could >>>> come to Oslo for a couple of days to brainstorm... >>>> >>>> Dag >>> >>> >>> The blocks are a good idea, but I think fairly complicated for >>> complicated matrix layouts. It would be nice to have something that is >>> reasonably efficient for at least most of the array storage >>> mechanisms. >>> I'm going to do a little brain dump below, let's see if anything is useful >>> :) >>> >>> What if we basically take the CSR format and make it a little simpler, >>> easier to handle, and better suited for other layouts. Basically, keep >>> shape/strides for dense arrays, and for blocked arrays just "extend" >>> your number of dimensions, i.e. a 2D blocked array becomes a 4D array, >>> something like this: >>> >>>>>> a = np.arange(4).repeat(4).reshape(4, 4); >>>>>> a >>> >>> array([[0, 0, 0, 0], >>> [1, 1, 1, 1], >>> [2, 2, 2, 2], >>> [3, 3, 3, 3]]) >>> >>>>>> a.shape = (2, 2, 2, 2) >>>>>> itemsize = a.dtype.itemsize >>>>>> a.strides = (8 * itemsize, 2 * itemsize, 4 * itemsize, 1 * itemsize) >>>>>> a >>> >>> array([[[[0, 0], >>> [1, 1]], >>> >>> [[0, 0], >>> [1, 1]]], >>> >>> >>> [[[2, 2], >>> [3, 3]], >>> >>> [[2, 2], >>> [3, 3]]]]) >>> >>>>>> print a.flatten() >>> >>> [0 0 1 1 0 0 1 1 2 2 3 3 2 2 3 3] >>> >>> Now, given some buffer flag (PyBUF_Sparse or something), use basically >>> suboffsets with indirect dimensions, where only ever the last >>> dimension is a row of contiguous memory (the entire thing may be >>> contiguous, but the point is that you don't care). This row may >>> >>> - be variable sized >>> - have either a single "column start" (e.g. diagonal, band/tri- >>> diagonal, block diagonal, etc), OR >>> - a list of column indices, the length of the row (like in the CSR >>> format) >>> >>> The length of each innermost row is then determined by looking at, in >>> order: >>> - the extent as specified in the shape list >>> - if -1, and some flag is set, the extent is determined like CSR, >>> i.e. (uintptr_t) row[i + 1] - (uintptr_t) row[i] >>> -> maybe instead of pointers indices are better, for >>> serialization, GPUs, etc >>> - otherwise, use either a function pointer or perhaps a list of >>> extents >>> >>> All these details will obviously be abstracted, allowing for easy >>> iteration, but it can also be used by ufuncs operating on contiguous >>> rows (often, since the actual storage is contiguous and stored in a 1D >>> array, some flag could indicate contiguity as an optimization for >>> unary ufuncs and flat iteration). Tiled nditer-ation could also work >>> here without too big a hassle. >>> When you slice, you add to the suboffset and manipulate the single >>> extent or entire list of extents in that dimension, and the flag to >>> determine the length using the pointer subtraction should be cleared >>> (this should probably all happen through vtable functions). >>> >>> An exporter would also be free to use different malloced pointers, >>> allowing variable sized array support with append/pop functionality in >>> multiple dimensions (if there are no active buffer views). >>> >>> Random access in the case where a column start is provided is still >>> contant time, and done though a[i][j][k - colstart], unless you have >>> discontiguous rows, in which case you are allowed a logarithmic search >>> (if the size exceeds some threshold). I see scipy.sparse does a linear >>> search, which is pretty slow in pathological cases: >>> >>> from scipy import sparse >>> a = np.random.random((1, 4000000)) >>> b = sparse.csr_matrix(a) >>> %timeit a[0, 1000] >>> %timeit b[0, 1000] >>> >>> 10000000 loops, best of 3: 190 ns per loop >>> 10 loops, best of 3: 29.3 ms per loop >> >> >> Heh. That is *extremely* pathological though, nobody does that in real code >> :-) >> >> Here's an idea I had yesterday: To get an ND sparse array with good spatial >> locality (for local operations etc.), you could map the elements to a >> volume-filling fractal and then use a hash-map with linear probing. I bet it >> either doesn't work or has been done already :-) >> >> >>> >>> Now, depending on the density and operation, the caller may have some >>> idea of how to allocate an output array. I'm not sure how to handle >>> "insertions" of new elements, but I presume through vtable put/insert >>> functions. I'm also not sure how to fit this in with linear algebra >>> functionality, other than providing conversions of the view. >>> >>> I think a disadvantage of this scheme is that you can't reorder your >>> axes anymore, and many operations that are easy in the dense case are >>> suddenly harder, e.g. this scheme does not allow you to go from a >>> csr-like format into csc. But I think what this gives is reasonable >>> generality to allow easy use in C/Fortran/Cython compiled code/numpy >>> ufunc invocation, as well as allowing efficient-ish storage for >>> various kinds of arrays. >>> >>> Does this make any sense? >> >> >> I'll admit I didn't go through the finer details of your idea; let's deal >> with the below first and then I can re-read your post later. >> >> What I'm thinking is that this seems interesting, but perhaps it lowers the >> versatility so much that it's not really worth to consider, for the GSoC at >> least. If the impact isn't high enough, my hunch is that one may as well not >> bother and just do strided arrays. >> >> I actually believe that the *likely* outcome of this discussion is to stick >> to the original plan and focus on expressions with strided arrays. But let's >> see. >> >> I'm not sure if what brought me to the blocked case was really block-sparse >> arrays or diagonal arrays. Let's see...: >> >> 1) Memory conservation. The array would not be NOT element-sparse, it's just >> that you don't want to hold it all in memory at once. >> >> Examples: >> >> - "a_array *= np.random.normal(size=a_array.shape)". The right hand side >> can be generated on the fly (making it return the same data for each block >> every time is non-trivial but can be done). If a takes up a good chunk of >> RAM, there might not even be enough memory for the right-hand-side except >> for block-by-block. (That saves bandwidth too.) >> >> - You want to stream from one file to another file, neither of which will >> fit in RAM. Here we really don't care about speed, just code reuse...it's >> irritating to have to manually block in such a situation. >> >> - You want to play with a new fancy array format (a blocked format that's >> faster for some linear algebra, say). But then you need to call another C >> library that takes data+shape+stride+dtype. Then you need to make a copy, >> which you'd rather avoid -- so an alternative is to make that C library be >> based on the block API so that it can interfaces with your fancy new format >> (and anything else). >> >> 2) Bandwidth conservation. Numexpr, Theano, Numba and >> Cython-vectorized-expressions all already deal with this problem on one >> level. >> >> However, I also believe there's a higher level in many programs where blocks >> come into play. The organization of many scientific codes essentially reads >> "do A on 8 GB of data, then do B on 8 GB of data"; and it's going to be a >> *long* time before you have full-program analysis to block up that in every >> case; the tools above will be able to deal with some almost-trivial cases >> only. >> >> A block API could be used to write "pipeline programs". For instance, >> consider >> >> a_array[:, None] * f(x_array) >> >> and f is some rather complicated routine in Fortran that is NOT a ufunc -- >> it takes all of x_array as input and provides all the output "at once"; but >> at some point it needs to do the writing of the output, and if writing to an >> API it could do the multiplication with a_array while the block is in cache >> anyway and save a memory bus round-trip. (Provided the output isn't needed >> as scratch space.) >> >> Summing up: >> >> Vectorized expressions on contiguous (or strided) memory in Cython is pretty >> nice by itself; it would bring us up to the current state of the art of >> static compilation (in Fortran compilers). >> >> But my hunch is your sparse-block proposal doesn't add enough flexibility to >> be worth the pain. Many of the cases above can't be covered with it. >> >> If it's possible to identify a little nugget API that is forward-compatible >> with the above usecases (it wouldn't solve them perhaps, but allow them to >> be solved with some extra supporting code), then it might be worth to a) >> implement it in NumPy, b) support it for Cython vectorized expressions; and >> use that to support block-transposition. >> >> But I'm starting to feel that we can't really know how that little nugget >> API should really look until the above has been explored in a little more >> depth; and then we are easily talking a too large scope without tying it to >> a larger project (which can't really before the GSoC..). For instance, 2) >> suggests push/pull rather than put/get. >> >> Dag > > I assumed as much, in any case I was going to start with dense arrays, > simply because they are in common use and well-defined at this point. > Maybe what we really want is just lazy evaluation that works for > different array layouts and storage mechanisms, and a smart JIT that > can evaluate our linear algebra and array expressions in an optimal > fashion (memory/cache-wise, but also out-of-core wise). This probably > means writing everything in a high level language, because even if > your Fortran routines themselves would use your lazy evaluation > library, your linear algebra library wouldn't for sure :) Anyway, I'm
My feeling is there should definitely be some very sweet spot somewhere that's not as good as everything-lazily-evaluated/full program analysis, but still much better than where we are today. Of course, that doesn't mean that you should bother about it :-) > going to be pragmatic and investigate how much of Theano we can reuse > in Cython now. Sounds good. At least now the larger world of blocking is in the back of our minds as things proceed. Dag _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion