Hi Chris, > Actually, I agree about the truncation issue, but it's a question of where > to put it -- I'm suggesting that I don't want it at the python<->numpy > interface.
Yes, that's a good point. Of course, by using Latin-1 rather than UTF-8 we can't support all Unicode code points (hence the "?" replacement possible on read from HDF5). > do vlen strings support full unicode? -- then, yes, that's good. Yes, they do. It's somewhat unfortunate to immediately cast to vlen though, since people usually have fixed-width datasets to start with for efficiency reasons... > what about reading from fixed-width UTF-8 to 'U' -- that seems like the > natural way to go for unicode. Tough a bit hard to know how long U needs to > be -- but <= the length of the utf-8 array (in characters). Space concerns ("U" has a 4x space penalty for ASCII-ish data). Plus, for similar reasons to this discussion, creating "U" datasets is unsupported at the moment. > note that I'm also proposing a "bytes" dtype, which might make sense for > grabbing utf-8 data from HDF-5. Then either h5py or the user could decode to > a unicode type. Sound quite like the existing 'S' type. >> In any case, I can say that the lack of an text 'S' type in NumPy has >> been a significant pain point for h5py users on Python 3 over the >> years. > > isn't the current 'S' a pretty good map to hdf ascii? Yes; in fact, right now all fixed-width strings in h5py (ASCII and UTF-8) are read/written as 'S'. The problem is that on Py3, 'S' is treated as bytes, not text, so you can't freely mix it with str. I am about to leave for the weekend... thanks for a great discussion! To conclude, it strikes me that in choosing an encoding we get to pick at most two of the following: 1. Support for all Unicode characters 2. Fixed number of characters 3. Fixed number of storage bytes At this point, I would vote for UTF-8 in a fixed width buffer (1/3), but I imagine as this progresses towards a NEP others will weigh in. Andrew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion