Re: [Numpy-discussion] proposal: smaller representation of string arrays

Stephan Hoyer Mon, 24 Apr 2017 20:02:24 -0700

On Mon, Apr 24, 2017 at 7:41 PM, Nathaniel Smith <[email protected]> wrote:


> But also, is it important whether strings we're loading/saving to an
> HDF5 file have the same in-memory representation in numpy as they
> would in the file? I *know* [1] no-one is reading HDF5 files using
> np.memmap :-).


Of course they do :)
https://github.com/jjhelmus/pyfive/blob/98d26aaddd6a7d83cfb189c113e172cc1b60d5f8/pyfive/low_level.py#L682


> Also, further searching suggests that HDF5 actually supports all of
> nul termination, nul padding, and space padding, and that nul
> termination is the default? How much does it help to have in-memory
> compatibility with just one of these options (and not even the default
> one)? Would we need to add the other options to be really useful for
> HDF5?


h5py actually ignores this option and only uses null termination. I have
not heard any complaints about this (though I have heard complaints about
the lack of fixed-length UTF-8).

But more generally, you're right. h5py doesn't need a corresponding NumPy
dtype for each HDF5 string dtype, though that would certainly be
*convenient*. In fact, it already (ab)uses NumPy's dtype metadata with
h5py.special_dtype to indicate a homogeneous string type for object arrays.

I would guess h5py users have the same needs for efficient string
representations (including surrogate-escape options) as other scientific
users.

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to