So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it?
On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <fal...@gmail.com> wrote: > 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <sho...@gmail.com>: > >> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >>> It's worthwhile enough that both major HDF5 bindings don't support >>> Unicode arrays, despite user requests for years. The sticking point seems >>> to be the difference between HDF5's view of a Unicode string array (defined >>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >>> string array (because of UCS-4, defined by the number of >>> characters/codepoints/whatever). So there are HDF5 files out there that >>> none of our HDF5 bindings can read, and it is impossible to write certain >>> data efficiently. >>> >>> >>> I would really like to hear more from the authors of these libraries >>> about what exactly it is they feel they're missing. Is it that they want >>> numpy to enforce the length limit early, to catch errors when the array is >>> modified instead of when they go to write it to the file? Is it that they >>> really want an O(1) way to look at a array and know the maximum number of >>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >>> is really annoying and files that need it are rare so they haven't had the >>> motivation to implement it? My impression is similar to Julian's: you >>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >>> dozen lines of code, which is nothing compared to all the other hoops these >>> libraries are already jumping through, so if this is really the roadblock >>> then I must be missing something. >>> >> >> I actually agree with you. I think it's mostly a matter of convenience >> that h5py matched up HDF5 dtypes with numpy dtypes: >> fixed width ASCII -> np.string_/bytes >> variable length ASCII -> object arrays of np.string_/bytes >> variable length UTF-8 -> object arrays of unicode >> >> This was tenable in a Python 2 world, but on Python 3 it's broken and >> there's not an easy fix. >> >> We absolutely could fix h5py by mapping everything to object arrays of >> Python unicode strings, as has been discussed ( >> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >> would be a fine but non-ideal solution, since there is currently no fixed >> width UTF-8 support. >> >> For fixed width ASCII arrays, this would mean increased convenience for >> Python 3 users, at the price of decreased convenience for Python 2 users >> (arrays now contain boxed Python objects), unless we made the h5py behavior >> dependent on the version of Python. Hence, we're back here, waiting for >> better dtypes for encoded strings. >> >> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >> handling ASCII arrays as strings) and UTF-8 with length equal to the number >> of bytes. >> > > Well, I'll say upfront that I have not read this discussion in the fully, > but apparently some opinions from developers of HDF5 Python packages would > be welcome here, so here I go :) > > As a long-time developer of one of the Python HDF5 packages (PyTables), I > have always been of the opinion that plain ASCII (for byte strings) and > UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing > large amounts of data, most specially for disk storage (but also using > compressed in-memory containers). My rational is that, although UCS-4 may > require way too much space, compression would reduce that to basically the > space that is required by compressed UTF-8 (I won't go into detail, but > basically this is possible by using the shuffle filter). > > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think there is a go back (not even adding UCS-4 support on it, > although I continue to think it would be a good idea). So, I suppose that > if HDF5 is found to be an important format for NumPy users (and I think > this is the case), a solution for representing Unicode characters by using > UTF-8 in NumPy would be desirable (at the risk of making the implementation > more complex). > > Francesc > > >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> >> > > > -- > Francesc Alted > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion