[Numpy-discussion] Re: Arrays of variable itemsize

Jim Pivarski Wed, 13 Mar 2024 10:33:18 -0700

Awkward is more general: it has all the same data types (and is zero-copy
compatible with) Apache Arrow.


ragged is only lists (of lists) of numbers, so that it's possible to
describe as a shape and dtype. ragged adheres to the Array API, like NumPy
2.0 (am I right in that)? So, ragged is a useful subset.




On Wed, Mar 13, 2024, 1:17 PM Dom Grigonis <dom.grigo...@gmail.com> wrote:

> Thanks for this.
>
> Random access is unfortunately a requirement.
>
> By the way, what is the difference between awkward and ragged?
>
> On 13 Mar 2024, at 18:59, Jim Pivarski <jpivar...@gmail.com> wrote:
>
> After sending that email, I realize that I have to take it back: your
> motivation is to minimize memory use. The variable-length lists in Awkward
> Array (and therefore in ragged as well) are implemented using offset
> arrays, and they're at minimum 32-bit. The scheme is more cache-coherent
> (less "pointer chasing"), but doesn't reduce the size.
>
> These offsets are 32-bit so that individual values can be selected from
> the array in constant time. If you use a smaller integer size, like uint8,
> then they have to be number of elements in the lists, rather than offsets
> (the cumsum of number of elements in the lists). Then, to find a single
> value, you have to add counts from the beginning of the array.
>
> A standard way to store variable-length integers is to put the indicator
> of whether you've seen the whole integer yet in a high bit (so each byte
> effectively contributes 7 bits). That's also inherently non-random access.
>
> But if random access is not a requirement, how about Blosc and bcolz?
> That's a library that uses a very lightweight compression algorithm on the
> arrays and uncompresses them on the fly (fast enough to be practical). That
> sounds like it would fit your use-case better...
>
> Jim
>
>
>
>
> On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <jpivar...@gmail.com> wrote:
>
>> This might be a good application of Awkward Array (
>> https://awkward-array.org), which applies a NumPy-like interface to
>> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged),
>> a restriction of that to only variable-length lists, but satisfying the
>> Array API standard.
>>
>> The variable-length data in Awkward Array hasn't been used to represent
>> arbitrary precision integers, though. It might be a good application of
>> "behaviors," which are documented here:
>> https://awkward-array.org/doc/main/reference/ak.behavior.html In
>> principle, it would be possible to define methods and overload NumPy ufuncs
>> to interpret variable-length lists of int8 as integers with arbitrary
>> precision. Numba might be helpful in accelerating that if normal
>> NumPy-style vectorization is insufficient.
>>
>> If you're interested in following this route, I can help with first
>> implementations of that arbitrary precision integer behavior. (It's an
>> interesting application!)
>>
>> Jim
>>
>>
>>
>> On Wed, Mar 13, 2024, 12:28 PM Matti Picus <matti.pi...@gmail.com> wrote:
>>
>>> I am not sure what kind of a scheme would support various-sized native
>>> ints. Any scheme that puts pointers in the array is going to be worse:
>>> the pointers will be 64-bit. You could store offsets to data, but then
>>> you would need to store both the offsets and the contiguous data, nearly
>>> doubling your storage. What shape are your arrays, that would be the
>>> minimum size of the offsets?
>>>
>>> Matti
>>>
>>>
>>> On 13/3/24 18:15, Dom Grigonis wrote:
>>> > By the way, I think I am referring to integer arrays. (Or integer part
>>> > of floats.)
>>> >
>>> > I don’t think what I am saying sensibly applies to floats as they are.
>>> >
>>> > Although, new float type could base its integer part on such concept.
>>> >
>>> > —
>>> >
>>> > Where I am coming from is that I started to hit maximum bounds on
>>> > integer arrays, where most of values are very small and some become
>>> > very large. And I am hitting memory limits. And I don’t have many
>>> > zeros, so sparse arrays aren’t an option.
>>> >
>>> > Approximately:
>>> > 90% of my arrays could fit into `np.uint8`
>>> > 1% requires `np.uint64`
>>> > the rest 9% are in between.
>>> >
>>> > And there is no predictable order where is what, so splitting is not
>>> > an option either.
>>> >
>>> >
>>> >> On 13 Mar 2024, at 17:53, Nathan <nathan.goldb...@gmail.com> wrote:
>>> >>
>>> >> Yes, an array of references still has a fixed size width in the array
>>> >> buffer. You can think of each entry in the array as a pointer to some
>>> >> other memory on the heap, which can be a dynamic memory allocation.
>>> >>
>>> >> There's no way in NumPy to support variable-sized array elements in
>>> >> the array buffer, since that assumption is key to how numpy
>>> >> implements strided ufuncs and broadcasting.,
>>> >>
>>> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <dom.grigo...@gmail.com>
>>>
>>> >> wrote:
>>> >>
>>> >>     Thank you for this.
>>> >>
>>> >>     I am just starting to think about these things, so I appreciate
>>> >>     your patience.
>>> >>
>>> >>     But isn’t it still true that all elements of an array are still
>>> >>     of the same size in memory?
>>> >>
>>> >>     I am thinking along the lines of per-element dynamic memory
>>> >>     management. Such that if I had array [0, 1e10000], the first
>>> >>     element would default to reasonably small size in memory.
>>> >>
>>> >>>     On 13 Mar 2024, at 16:29, Nathan <nathan.goldb...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>     It is possible to do this using the new DType system.
>>> >>>
>>> >>>     Sebastian wrote a sketch for a DType backed by the GNU
>>> >>>     multiprecision float library:
>>> >>>     https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype
>>> >>>
>>> >>>     It adds a significant amount of complexity to store data outside
>>> >>>     the array buffer and introduces the possibility of
>>> >>>     use-after-free and dangling reference errors that are impossible
>>> >>>     if the array does not use embedded references, so that’s the
>>> >>>     main reason it hasn’t been done much.
>>> >>>
>>> >>>     On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis
>>> >>>     <dom.grigo...@gmail.com> wrote:
>>> >>>
>>> >>>         Hi all,
>>> >>>
>>> >>>         Say python’s builtin `int` type. It can be as large as
>>> >>>         memory allows.
>>> >>>
>>> >>>         np.ndarray on the other hand is optimized for vectorization
>>> >>>         via strides, memory structure and many things that I
>>> >>>         probably don’t know. Well the point is that it is convenient
>>> >>>         and efficient to use for many things in comparison to
>>> >>>         python’s built-in list of integers.
>>> >>>
>>> >>>         So, I am thinking whether something in between exists? (And
>>> >>>         obviously something more clever than np.array(dtype=object))
>>> >>>
>>> >>>         Probably something similar to `StringDType`, but for
>>> >>>         integers and floats. (It’s just my guess. I don’t know
>>> >>>         anything about `StringDType`, but just guessing it must be
>>> >>>         better than np.array(dtype=object) in combination with
>>> >>>         np.vectorize)
>>> >>>
>>> >>>         Regards,
>>> >>>         dgpb
>>> >>>
>>> >>>         _______________________________________________
>>> >>>         NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> >>>         To unsubscribe send an email to
>>> >>>         numpy-discussion-le...@python.org
>>> >>>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> >>>         Member address: nathan12...@gmail.com
>>> >>>
>>> >>>     _______________________________________________
>>> >>>     NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> >>>     To unsubscribe send an email to
>>> numpy-discussion-le...@python.org
>>> >>>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> >>>     Member address: dom.grigo...@gmail.com
>>> >>
>>> >>     _______________________________________________
>>> >>     NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> >>     To unsubscribe send an email to numpy-discussion-le...@python.org
>>> >>
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> >>     Member address: nathan12...@gmail.com
>>> >>
>>> >> _______________________________________________
>>> >> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> >> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> >> Member address: dom.grigo...@gmail.com
>>> >
>>> >
>>> > _______________________________________________
>>> > NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> > To unsubscribe send an email to numpy-discussion-le...@python.org
>>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> > Member address: matti.pi...@gmail.com
>>> _______________________________________________
>>> NumPy-Discussion mailing list -- numpy-discussion@python.org
>>> To unsubscribe send an email to numpy-discussion-le...@python.org
>>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
>>> Member address: jpivar...@gmail.com
>>>
>> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: dom.grigo...@gmail.com
>
>
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: jpivar...@gmail.com
>

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: Arrays of variable itemsize

Reply via email to