2017-04-27 3:34 GMT+02:00 Stephan Hoyer <sho...@gmail.com>: > On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <n...@pobox.com> wrote: > >> It's worthwhile enough that both major HDF5 bindings don't support >> Unicode arrays, despite user requests for years. The sticking point seems >> to be the difference between HDF5's view of a Unicode string array (defined >> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >> string array (because of UCS-4, defined by the number of >> characters/codepoints/whatever). So there are HDF5 files out there that >> none of our HDF5 bindings can read, and it is impossible to write certain >> data efficiently. >> >> >> I would really like to hear more from the authors of these libraries >> about what exactly it is they feel they're missing. Is it that they want >> numpy to enforce the length limit early, to catch errors when the array is >> modified instead of when they go to write it to the file? Is it that they >> really want an O(1) way to look at a array and know the maximum number of >> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >> is really annoying and files that need it are rare so they haven't had the >> motivation to implement it? My impression is similar to Julian's: you >> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >> dozen lines of code, which is nothing compared to all the other hoops these >> libraries are already jumping through, so if this is really the roadblock >> then I must be missing something. >> > > I actually agree with you. I think it's mostly a matter of convenience > that h5py matched up HDF5 dtypes with numpy dtypes: > fixed width ASCII -> np.string_/bytes > variable length ASCII -> object arrays of np.string_/bytes > variable length UTF-8 -> object arrays of unicode > > This was tenable in a Python 2 world, but on Python 3 it's broken and > there's not an easy fix. > > We absolutely could fix h5py by mapping everything to object arrays of > Python unicode strings, as has been discussed ( > https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this would > be a fine but non-ideal solution, since there is currently no fixed width > UTF-8 support. > > For fixed width ASCII arrays, this would mean increased convenience for > Python 3 users, at the price of decreased convenience for Python 2 users > (arrays now contain boxed Python objects), unless we made the h5py behavior > dependent on the version of Python. Hence, we're back here, waiting for > better dtypes for encoded strings. > > So for HDF5, I see good use cases for ASCII-with-surrogateescape (for > handling ASCII arrays as strings) and UTF-8 with length equal to the number > of bytes. >
Well, I'll say upfront that I have not read this discussion in the fully, but apparently some opinions from developers of HDF5 Python packages would be welcome here, so here I go :) As a long-time developer of one of the Python HDF5 packages (PyTables), I have always been of the opinion that plain ASCII (for byte strings) and UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing large amounts of data, most specially for disk storage (but also using compressed in-memory containers). My rational is that, although UCS-4 may require way too much space, compression would reduce that to basically the space that is required by compressed UTF-8 (I won't go into detail, but basically this is possible by using the shuffle filter). I remember advocating for UCS-4 adoption in the HDF5 library many years ago (2007?), but I had no success and UTF-8 was decided to be the best candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I don't think there is a go back (not even adding UCS-4 support on it, although I continue to think it would be a good idea). So, I suppose that if HDF5 is found to be an important format for NumPy users (and I think this is the case), a solution for representing Unicode characters by using UTF-8 in NumPy would be desirable (at the risk of making the implementation more complex). Francesc > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion