Re: [Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern Mon, 24 Apr 2017 20:08:28 -0700

On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <[email protected]> wrote:
>
> On Mon, Apr 24, 2017 at 7:23 PM, Robert Kern <[email protected]>
wrote:
> > On Mon, Apr 24, 2017 at 7:07 PM, Nathaniel Smith <[email protected]> wrote:
> >
> >> That said, AFAICT what people actually want in most use cases is
support
> >> for arrays that can hold variable-length strings, and the only place
where
> >> the current approach is *optimal* is when we need mmap compatibility
with
> >> legacy formats that use fixed-width-nul-padded fields (at which point
it's
> >> super convenient). It's not even possible to *represent* all Python
strings
> >> or bytestrings in current numpy unicode or string arrays (Python
> >> strings/bytestrings can have trailing nuls). So if we're talking about
> >> tweaks to the current system it probably makes sense to focus on this
use
> >> case specifically.
> >>
> >> From context I'm assuming FITS files use fixed-width-nul-padding for
> >> strings? Is that right? I know HDF5 doesn't.
> >
> > Yes, HDF5 does. Or at least, it is supported in addition to the
> > variable-length ones.
> >
> > https://support.hdfgroup.org/HDF5/doc/Advanced/UsingUnicode/index.html
>
> Doh, I found that page but it was (and is) meaningless to me, so I
> went by http://docs.h5py.org/en/latest/strings.html, which says the
> options are fixed-width ascii, variable-length ascii, or
> variable-length utf-8 ... I guess it's just talking about what h5py
> currently supports.


It's okay, I made exactly the same mistake earlier in the thread. :-)

> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-). Is it important for some other reason?

The lack of such a dtype seems to be the reason why neither h5py nor
PyTables supports that kind of HDF5 Dataset. The variable-length Datasets
can take up a lot of disk-space because they can't be compressed (even
accounting for the wasted padding space). I mean, they probably could have
implemented it with objects arrays like h5py does with the variable-length
string Datasets, but they didn't.

https://github.com/PyTables/PyTables/issues/499
https://github.com/h5py/h5py/issues/624

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to