Awkward is more general: it has all the same data types (and is zero-copy compatible with) Apache Arrow.
ragged is only lists (of lists) of numbers, so that it's possible to describe as a shape and dtype. ragged adheres to the Array API, like NumPy 2.0 (am I right in that)? So, ragged is a useful subset. On Wed, Mar 13, 2024, 1:17 PM Dom Grigonis <[email protected]> wrote: > Thanks for this. > > Random access is unfortunately a requirement. > > By the way, what is the difference between awkward and ragged? > > On 13 Mar 2024, at 18:59, Jim Pivarski <[email protected]> wrote: > > After sending that email, I realize that I have to take it back: your > motivation is to minimize memory use. The variable-length lists in Awkward > Array (and therefore in ragged as well) are implemented using offset > arrays, and they're at minimum 32-bit. The scheme is more cache-coherent > (less "pointer chasing"), but doesn't reduce the size. > > These offsets are 32-bit so that individual values can be selected from > the array in constant time. If you use a smaller integer size, like uint8, > then they have to be number of elements in the lists, rather than offsets > (the cumsum of number of elements in the lists). Then, to find a single > value, you have to add counts from the beginning of the array. > > A standard way to store variable-length integers is to put the indicator > of whether you've seen the whole integer yet in a high bit (so each byte > effectively contributes 7 bits). That's also inherently non-random access. > > But if random access is not a requirement, how about Blosc and bcolz? > That's a library that uses a very lightweight compression algorithm on the > arrays and uncompresses them on the fly (fast enough to be practical). That > sounds like it would fit your use-case better... > > Jim > > > > > On Wed, Mar 13, 2024, 12:47 PM Jim Pivarski <[email protected]> wrote: > >> This might be a good application of Awkward Array ( >> https://awkward-array.org), which applies a NumPy-like interface to >> arbitrary tree-like data, or ragged (https://github.com/scikit-hep/ragged), >> a restriction of that to only variable-length lists, but satisfying the >> Array API standard. >> >> The variable-length data in Awkward Array hasn't been used to represent >> arbitrary precision integers, though. It might be a good application of >> "behaviors," which are documented here: >> https://awkward-array.org/doc/main/reference/ak.behavior.html In >> principle, it would be possible to define methods and overload NumPy ufuncs >> to interpret variable-length lists of int8 as integers with arbitrary >> precision. Numba might be helpful in accelerating that if normal >> NumPy-style vectorization is insufficient. >> >> If you're interested in following this route, I can help with first >> implementations of that arbitrary precision integer behavior. (It's an >> interesting application!) >> >> Jim >> >> >> >> On Wed, Mar 13, 2024, 12:28 PM Matti Picus <[email protected]> wrote: >> >>> I am not sure what kind of a scheme would support various-sized native >>> ints. Any scheme that puts pointers in the array is going to be worse: >>> the pointers will be 64-bit. You could store offsets to data, but then >>> you would need to store both the offsets and the contiguous data, nearly >>> doubling your storage. What shape are your arrays, that would be the >>> minimum size of the offsets? >>> >>> Matti >>> >>> >>> On 13/3/24 18:15, Dom Grigonis wrote: >>> > By the way, I think I am referring to integer arrays. (Or integer part >>> > of floats.) >>> > >>> > I don’t think what I am saying sensibly applies to floats as they are. >>> > >>> > Although, new float type could base its integer part on such concept. >>> > >>> > — >>> > >>> > Where I am coming from is that I started to hit maximum bounds on >>> > integer arrays, where most of values are very small and some become >>> > very large. And I am hitting memory limits. And I don’t have many >>> > zeros, so sparse arrays aren’t an option. >>> > >>> > Approximately: >>> > 90% of my arrays could fit into `np.uint8` >>> > 1% requires `np.uint64` >>> > the rest 9% are in between. >>> > >>> > And there is no predictable order where is what, so splitting is not >>> > an option either. >>> > >>> > >>> >> On 13 Mar 2024, at 17:53, Nathan <[email protected]> wrote: >>> >> >>> >> Yes, an array of references still has a fixed size width in the array >>> >> buffer. You can think of each entry in the array as a pointer to some >>> >> other memory on the heap, which can be a dynamic memory allocation. >>> >> >>> >> There's no way in NumPy to support variable-sized array elements in >>> >> the array buffer, since that assumption is key to how numpy >>> >> implements strided ufuncs and broadcasting., >>> >> >>> >> On Wed, Mar 13, 2024 at 9:34 AM Dom Grigonis <[email protected]> >>> >>> >> wrote: >>> >> >>> >> Thank you for this. >>> >> >>> >> I am just starting to think about these things, so I appreciate >>> >> your patience. >>> >> >>> >> But isn’t it still true that all elements of an array are still >>> >> of the same size in memory? >>> >> >>> >> I am thinking along the lines of per-element dynamic memory >>> >> management. Such that if I had array [0, 1e10000], the first >>> >> element would default to reasonably small size in memory. >>> >> >>> >>> On 13 Mar 2024, at 16:29, Nathan <[email protected]> >>> wrote: >>> >>> >>> >>> It is possible to do this using the new DType system. >>> >>> >>> >>> Sebastian wrote a sketch for a DType backed by the GNU >>> >>> multiprecision float library: >>> >>> https://github.com/numpy/numpy-user-dtypes/tree/main/mpfdtype >>> >>> >>> >>> It adds a significant amount of complexity to store data outside >>> >>> the array buffer and introduces the possibility of >>> >>> use-after-free and dangling reference errors that are impossible >>> >>> if the array does not use embedded references, so that’s the >>> >>> main reason it hasn’t been done much. >>> >>> >>> >>> On Wed, Mar 13, 2024 at 8:17 AM Dom Grigonis >>> >>> <[email protected]> wrote: >>> >>> >>> >>> Hi all, >>> >>> >>> >>> Say python’s builtin `int` type. It can be as large as >>> >>> memory allows. >>> >>> >>> >>> np.ndarray on the other hand is optimized for vectorization >>> >>> via strides, memory structure and many things that I >>> >>> probably don’t know. Well the point is that it is convenient >>> >>> and efficient to use for many things in comparison to >>> >>> python’s built-in list of integers. >>> >>> >>> >>> So, I am thinking whether something in between exists? (And >>> >>> obviously something more clever than np.array(dtype=object)) >>> >>> >>> >>> Probably something similar to `StringDType`, but for >>> >>> integers and floats. (It’s just my guess. I don’t know >>> >>> anything about `StringDType`, but just guessing it must be >>> >>> better than np.array(dtype=object) in combination with >>> >>> np.vectorize) >>> >>> >>> >>> Regards, >>> >>> dgpb >>> >>> >>> >>> _______________________________________________ >>> >>> NumPy-Discussion mailing list -- [email protected] >>> >>> To unsubscribe send an email to >>> >>> [email protected] >>> >>> >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> >>> Member address: [email protected] >>> >>> >>> >>> _______________________________________________ >>> >>> NumPy-Discussion mailing list -- [email protected] >>> >>> To unsubscribe send an email to >>> [email protected] >>> >>> >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> >>> Member address: [email protected] >>> >> >>> >> _______________________________________________ >>> >> NumPy-Discussion mailing list -- [email protected] >>> >> To unsubscribe send an email to [email protected] >>> >> >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> >> Member address: [email protected] >>> >> >>> >> _______________________________________________ >>> >> NumPy-Discussion mailing list -- [email protected] >>> >> To unsubscribe send an email to [email protected] >>> >> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> >> Member address: [email protected] >>> > >>> > >>> > _______________________________________________ >>> > NumPy-Discussion mailing list -- [email protected] >>> > To unsubscribe send an email to [email protected] >>> > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> > Member address: [email protected] >>> _______________________________________________ >>> NumPy-Discussion mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ >>> Member address: [email protected] >>> >> _______________________________________________ > NumPy-Discussion mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: [email protected] > > > _______________________________________________ > NumPy-Discussion mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: [email protected] >
_______________________________________________ NumPy-Discussion mailing list -- [email protected] To unsubscribe send an email to [email protected] https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: [email protected]
