> But HDF5 > additionally has a fixed-storage-width UTF8 type, so we could map to a > NumPy fixed-storage-width type trivially.
Sure -- this is why *nix uses utf-8 for filenames -- it can just be a char*. But that just punts the problem to client code. I think a UTF-8 string type does not match the numpy model well, and I don't think we should support it just because it would be easier for the HDF 5 wrappers. ( to be fair, there are probably other similar systems numpy wants to interface with that cod use this...) It seems if you want a 1:1 binary mapping between HDF and numpy for utf strings, then a bytes type in numpy makes more sense. Numpy could/should have encode and decode methods for converting byte arrays to/from Unicode arrays (does it already? ). > "Custom" in this context means a user-created HDF5 data-conversion > filter, which is necessary since all data conversion is handled inside > the HDF5 library. > As far as generic Unicode goes, we currently don't support the NumPy > "U" dtype in h5py for similar reasons; there's no destination type in > HDF5 which (1) would preserve the dtype for round-trip write/read > operations and (2) doesn't risk truncation. It sounds to like HDF5 simply doesn't support Unicode. Calling an array of bytes utf-8 simple pushes the problem on to client libs. As that's where the problem lies, then the PyHDF may be the place to address it. If we put utf-8 in numpy, we have the truncation problem there instead -- which is exactly what I think we should avoid. > A Latin-1 based 'a' type > would have similar problems. Maybe not -- latin1 is fixed width. >> Does HDF enforce ascii-only? what does it do with the > 127 values? > > Unfortunately/fortunately the charset is not enforced for either ASCII So you can dump Latin-1 into and out of the HDF 'ASCII' type -- it's essentially the old char* / py2 string. An ugly situation, but why not use it? > or UTF-8, So ASCII and utf-8 are really the same thing, with different meta-data... > although the HDF Group has been thinking about it. I wonder if they would consider going Latin-1 instead of ASCII -- similarly to utf-8 it's backward compatible with ASCII, but gives you a little more. I don't know that there is another 1byte encoding worth using -- it maybe be my English bias, but it seems Latin-1 gives us ASCII+some extra stuff handy for science ( I use the degree symbol a lot, for instance) with nothing lost. > Ideally, NumPy would support variable-length > strings, in which case all these headaches would go away. Would they? That would push the problem back to PyHDF -- which I'm arguing is where it belongs, but I didn't think you were ;-) > > But I > imagine that's also somewhat complicated. :) That's a whole other kettle of fish, yes. -Chris _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion