Hi Oscar,

> Is it fair to say that people should really be using vlen utf-8 strings for
> text? Is it problematic because of the need to interface with non-Python
> libraries using the same hdf5 file?

The general recommendation has been to use fixed-width strings for
exactly that reason; FORTRAN programs can't handle vlens, and older
versions of IDL would refuse to deal with anything labelled utf-8,
even fixed-width.

>> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary
>> > encoding dtype, anyway).
>
> That's what I was thinking. A ragged utf-8 array could map to an array of vlen
> strings. Or am I misunderstanding how hdf5 works?

Yes, that's exactly how HDF5 works for this; at the moment, we handle
vlens with the NumPy object ("O") type storing regular Python strings.
 A native variable-length NumPy equivalent would also be appreciated,
although I suspect it's a lot of work.

> Truncating utf-8 is never a good idea. Throwing an error message when it would
> truncate is okay though. Presumably you already do this when someone tries to
> assign an ASCII string that's too long right?

We advertise that HDF5 datasets work identically (as closely as
practical) to NumPy arrays; in this case, NumPy truncates and doesn't
warn, so we do the same.

The concern with "U" is more that someone would write a "U10" string
into a 10-byte HDF5 buffer and lose data, even though the advertised
widths were the same. As an observation, a pure-ASCII NumPy type like
the proposed "s" would avoid that completely.  With a latin-1 type, it
could still happen as certain characters would become 2 UTF-8 bytes.

Andrew
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to