[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thanks for reiterating, this looks promising! > On 13 Mar 2024, at 23:22, Jim Pivarski wrote: > > So that this doesn't get lost amid the discussion: > https://www.blosc.org/python-blosc2/python-blosc2.html > > > Blosc is on-the-fly comp

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
So that this doesn't get lost amid the discussion: https://www.blosc.org/python-blosc2/python-blosc2.html Blosc is on-the-fly compression, which is a more extreme way of making variable-sized integers. The compression is in small chunks that fit into CPU cachelines, such that it's random access pe

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
My array is growing in a manner of: array[slice] += values so for now will just clip values: res = np.add(array[slice], values, dtype=np.int64) array[slice] = res mask = res > MAX_UINT16 array[slice][mask] = MAX_UINT16 For this case, these large values do not have that much impact. And extra ope

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Homeier, Derek
On 13 Mar 2024, at 6:01 PM, Dom Grigonis wrote: So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case. Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
Awkward is more general: it has all the same data types (and is zero-copy compatible with) Apache Arrow. ragged is only lists (of lists) of numbers, so that it's possible to describe as a shape and dtype. ragged adheres to the Array API, like NumPy 2.0 (am I right in that)? So, ragged is a useful

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thanks for this. Random access is unfortunately a requirement. By the way, what is the difference between awkward and ragged? > On 13 Mar 2024, at 18:59, Jim Pivarski wrote: > > After sending that email, I realize that I have to take it back: your > motivation is to minimize memory use. The v

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Yup yup, good point. So my array sizes in this case are 3e8. Thus, 32bit ints would be needed. So it is not a solution for this case. Nevertheless, such concept would still be worthwhile for cases where integers are say max 256bits (or unlimited), then even if memory addresses or offsets are 6

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
After sending that email, I realize that I have to take it back: your motivation is to minimize memory use. The variable-length lists in Awkward Array (and therefore in ragged as well) are implemented using offset arrays, and they're at minimum 32-bit. The scheme is more cache-coherent (less "point

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Jim Pivarski
This might be a good application of Awkward Array (https://awkward-array.org), which applies a NumPy-like interface to arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged), a restriction of that to only variable-length lists, but satisfying the Array API standard. The variabl

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Matti Picus
I am not sure what kind of a scheme would support various-sized native ints. Any scheme that puts pointers in the array is going to be worse: the pointers will be 64-bit. You could store offsets to data, but then you would need to store both the offsets and the contiguous data, nearly doubling

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Kevin Sheppard
Does the new DType system in NumPy 2 make something like this more possible? I would suspect that the user would have to write a lot of code to have reasonable performance if it was. Kevin On Wed, Mar 13, 2024 at 3:55 PM Nathan wrote: > Yes, an array of references still has a fixed size width

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
By the way, I think I am referring to integer arrays. (Or integer part of floats.) I don’t think what I am saying sensibly applies to floats as they are. Although, new float type could base its integer part on such concept. — Where I am coming from is that I started to hit maximum bounds on in

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Nathan
Yes, an array of references still has a fixed size width in the array buffer. You can think of each entry in the array as a pointer to some other memory on the heap, which can be a dynamic memory allocation. There's no way in NumPy to support variable-sized array elements in the array buffer, sinc

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Thank you for this. I am just starting to think about these things, so I appreciate your patience. But isn’t it still true that all elements of an array are still of the same size in memory? I am thinking along the lines of per-element dynamic memory management. Such that if I had array [0, 1e

[Numpy-discussion] Re: Arrays of variable itemsize

2024-03-13 Thread Nathan
It is possible to do this using the new DType system. Sebastian wrote a sketch for a DType backed by the GNU multiprecision float library: https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype It adds a significant amount of complexity to store data outside the array buffer and introduces

[Numpy-discussion] Arrays of variable itemsize

2024-03-13 Thread Dom Grigonis
Hi all, Say python’s builtin `int` type. It can be as large as memory allows. np.ndarray on the other hand is optimized for vectorization via strides, memory structure and many things that I probably don’t know. Well the point is that it is convenient and efficient to use for many things in com