Hi Chris, > it looks from here: > http://www.hdfgroup.org/HDF5/doc/ADGuide/WhatsNew180.html > > that HDF uses utf-8 for unicode strings -- so you _could_ roundtrip with a > lot of calls to encode/decode -- which could be pretty slow, compared to > other ways to dump numpy arrays into HDF-5 -- that may be waht you mean by > "doesn't round trip".
HDF5 does have variable-length string support for UTF-8, so we map that directly to the unicode type (str on Py3) exactly as you describe, by encoding when we write to the file. But there's no way to round-trip with *fixed-width* strings. You can go from e.g. a 10 byte ASCII string to "U10", but going the other way fails if there are characters which take more than 1 byte to represent. We don't always get to choose the destination type, when e.g. writing into an existing dataset, so we can't always write vlen strings. > This may be a good case for a numpy utf-8 dtype, I suppose (or a arbitrary > encoding dtype, anyway). > But: How does hdf handle the fact that utf-8 is not a fixed length encoding? With fixed-width strings it doesn't, really. If you use vlen strings it's fine, but otherwise there's just a fixed-width buffer labelled "UTF-8". Presumably you're supposed to be careful when writing not to chop the string off in the middle of a multibyte character. We could truncate strings on their way to the file, but the risk of data loss/corruption led us to simply not support it at all. > hmm -- ascii does have those advantages, but I'm not sure its worth the > restriction on what can be encoded. But you're quite right, you could dump > asciii straight into something expecting utf-8, whereas you could not do > that with latin-1, for instance. But you can't go the other way -- does it > help much to avoided encoding in one direction? It would help for h5py specifically because most HDF5 strings are labelled "ASCII". But it's a question for the community which is more important: the high-bit characters in latin-1, or write-compatibility with UTF-8. Andrew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion