Hi Oscar, > Is it fair to say that people should really be using vlen utf-8 strings for > text? Is it problematic because of the need to interface with non-Python > libraries using the same hdf5 file?
The general recommendation has been to use fixed-width strings for exactly that reason; FORTRAN programs can't handle vlens, and older versions of IDL would refuse to deal with anything labelled utf-8, even fixed-width. >> > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary >> > encoding dtype, anyway). > > That's what I was thinking. A ragged utf-8 array could map to an array of vlen > strings. Or am I misunderstanding how hdf5 works? Yes, that's exactly how HDF5 works for this; at the moment, we handle vlens with the NumPy object ("O") type storing regular Python strings. A native variable-length NumPy equivalent would also be appreciated, although I suspect it's a lot of work. > Truncating utf-8 is never a good idea. Throwing an error message when it would > truncate is okay though. Presumably you already do this when someone tries to > assign an ASCII string that's too long right? We advertise that HDF5 datasets work identically (as closely as practical) to NumPy arrays; in this case, NumPy truncates and doesn't warn, so we do the same. The concern with "U" is more that someone would write a "U10" string into a 10-byte HDF5 buffer and lose data, even though the advertised widths were the same. As an observation, a pure-ASCII NumPy type like the proposed "s" would avoid that completely. With a latin-1 type, it could still happen as certain characters would become 2 UTF-8 bytes. Andrew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion