On Mon, Nov 20, 2023 at 10:08 PM Sebastien Binet <[email protected]> wrote:
> hi there,
>
> I have written a Go package[1] that can read/write simple arrays in the
> numpy file format [2].
> when I wrote it, it was for simple interoperability use cases, but now
> people would like to be able to read back ragged-arrays[3].
>
> unless I am mistaken, this means I need to interpret pieces of pickled
> data (`ndarray`, `multiarray` and `dtype`).
>
> so I am trying to understand how to unpickle `dtype` values that have been
> pickled:
>
> ```python
> import numpy as np
> import pickle
> import pickletools as pt
>
> pt.dis(pickle.dumps(np.dtype("int32"), protocol=4), annotate=True)
> ```
>
> gives:
> ```
> 0: \x80 PROTO 4 Protocol version indicator.
> 2: \x95 FRAME 55 Indicate the beginning of a new frame.
> 11: \x8c SHORT_BINUNICODE 'numpy' Push a Python Unicode string object.
> 18: \x94 MEMOIZE (as 0) Store the stack top into the memo.
> The stack is not popped.
> 19: \x8c SHORT_BINUNICODE 'dtype' Push a Python Unicode string object.
> 26: \x94 MEMOIZE (as 1) Store the stack top into the memo.
> The stack is not popped.
> 27: \x93 STACK_GLOBAL Push a global object (module.attr) on
> the stack.
> 28: \x94 MEMOIZE (as 2) Store the stack top into the memo.
> The stack is not popped.
> 29: \x8c SHORT_BINUNICODE 'i4' Push a Python Unicode string object.
> 33: \x94 MEMOIZE (as 3) Store the stack top into the memo.
> The stack is not popped.
> 34: \x89 NEWFALSE Push False onto the stack.
> 35: \x88 NEWTRUE Push True onto the stack.
> 36: \x87 TUPLE3 Build a three-tuple out of the top
> three items on the stack.
> 37: \x94 MEMOIZE (as 4) Store the stack top into the memo.
> The stack is not popped.
> 38: R REDUCE Push an object built from a callable
> and an argument tuple.
> 39: \x94 MEMOIZE (as 5) Store the stack top into the memo.
> The stack is not popped.
> 40: ( MARK Push markobject onto the stack.
> 41: K BININT1 3 Push a one-byte unsigned integer.
> 43: \x8c SHORT_BINUNICODE '<' Push a Python Unicode string object.
> 46: \x94 MEMOIZE (as 6) Store the stack top into the memo.
> The stack is not popped.
> 47: N NONE Push None on the stack.
> 48: N NONE Push None on the stack.
> 49: N NONE Push None on the stack.
> 50: J BININT -1 Push a four-byte signed integer.
> 55: J BININT -1 Push a four-byte signed integer.
> 60: K BININT1 0 Push a one-byte unsigned integer.
> 62: t TUPLE (MARK at 40) Build a tuple out of the topmost
> stack slice, after markobject.
> 63: \x94 MEMOIZE (as 7) Store the stack top into the
> memo. The stack is not popped.
> 64: b BUILD Finish building an object, via
> __setstate__ or dict update.
> 65: . STOP Stop the unpickling machine.
> highest protocol among opcodes = 4
> ```
>
> I have tried to find the usual `__reduce__` and `__setstate__` methods to
> understand what are the various arguments, to no avail.
>
First, be sure to read the generic `object.__reduce__` docs:
https://docs.python.org/3.11/library/pickle.html#object.__reduce__
Here is the C source for `np.dtype.__reduce__()`:
https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2623-L2750
And `np.dtype.__setstate__()`:
https://github.com/numpy/numpy/blob/main/numpy/_core/src/multiarray/descriptor.c#L2787-L3151
so, in :
> ```python
> >>> np.dtype("int32").__reduce__()[1]
> ('i4', False, True)
>
These are arguments to the `np.dtype` constructor and are documented in
`np.dtype.__doc__`. The `False, True` arguments are hardcoded and always
those values.
> >>> np.dtype("int32").__reduce__()[2]
> (3, '<', None, None, None, -1, -1, 0)
>
These are arguments to pass to `np.dtype.__setstate__()` after the object
has been created.
0. `3` is the version number of the state; `3` is typical for simple
dtypes; datetimes and others with metadata will bump this to `4` and use a
9-element tuple instead of this 8-element tuple.
1. `'<'` is the endianness flag.
2. If there are subarrays
<https://numpy.org/doc/stable/reference/arrays.dtypes.html#index-7> (e.g.
`np.dtype((np.int32, (2,2)))`), that info here.
3. If there are fields, a tuple of the names of the fields
4. If there are fields, the field descriptor dict.
5. If extended dtype (e.g. fields, strings, void, etc.), the element size,
else `-1`.
6. If extended dtype, the alignment flag, else `-1`.
7. The `flags` bit-flags; see `np.dtype.flags.__doc__`.
8. If datetime or with metadata, that metadata here, else absent.
--
Robert Kern
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]