2017-04-27 13:27 GMT+02:00 Neal Becker <ndbeck...@gmail.com>: > So while compression+ucs-4 might be OK for out-of-core representation, > what about in-core? blosc+ucs-4? I don't think that works for mmap, does > it? >
Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-memory, there are several compressed data containers, like: https://github.com/alimanfoo/zarr (meant mainly for multidimensional data containers) https://github.com/Blosc/bcolz (meant mainly for tabular data containers) (there might be others). > > On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted <fal...@gmail.com> wrote: > >> 2017-04-27 3:34 GMT+02:00 Stephan Hoyer <sho...@gmail.com>: >> >>> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith <n...@pobox.com> wrote: >>> >>>> It's worthwhile enough that both major HDF5 bindings don't support >>>> Unicode arrays, despite user requests for years. The sticking point seems >>>> to be the difference between HDF5's view of a Unicode string array (defined >>>> in size by the bytes of UTF-8 data) and numpy's current view of a Unicode >>>> string array (because of UCS-4, defined by the number of >>>> characters/codepoints/whatever). So there are HDF5 files out there >>>> that none of our HDF5 bindings can read, and it is impossible to write >>>> certain data efficiently. >>>> >>>> >>>> I would really like to hear more from the authors of these libraries >>>> about what exactly it is they feel they're missing. Is it that they want >>>> numpy to enforce the length limit early, to catch errors when the array is >>>> modified instead of when they go to write it to the file? Is it that they >>>> really want an O(1) way to look at a array and know the maximum number of >>>> bytes needed to represent it in utf-8? Is it that utf8<->utf-32 conversion >>>> is really annoying and files that need it are rare so they haven't had the >>>> motivation to implement it? My impression is similar to Julian's: you >>>> *could* implement HDF5 fixed-length utf-8 <-> numpy U arrays with a few >>>> dozen lines of code, which is nothing compared to all the other hoops these >>>> libraries are already jumping through, so if this is really the roadblock >>>> then I must be missing something. >>>> >>> >>> I actually agree with you. I think it's mostly a matter of convenience >>> that h5py matched up HDF5 dtypes with numpy dtypes: >>> fixed width ASCII -> np.string_/bytes >>> variable length ASCII -> object arrays of np.string_/bytes >>> variable length UTF-8 -> object arrays of unicode >>> >>> This was tenable in a Python 2 world, but on Python 3 it's broken and >>> there's not an easy fix. >>> >>> We absolutely could fix h5py by mapping everything to object arrays of >>> Python unicode strings, as has been discussed ( >>> https://github.com/h5py/h5py/pull/871). For fixed width UTF-8, this >>> would be a fine but non-ideal solution, since there is currently no fixed >>> width UTF-8 support. >>> >>> For fixed width ASCII arrays, this would mean increased convenience for >>> Python 3 users, at the price of decreased convenience for Python 2 users >>> (arrays now contain boxed Python objects), unless we made the h5py behavior >>> dependent on the version of Python. Hence, we're back here, waiting for >>> better dtypes for encoded strings. >>> >>> So for HDF5, I see good use cases for ASCII-with-surrogateescape (for >>> handling ASCII arrays as strings) and UTF-8 with length equal to the number >>> of bytes. >>> >> >> Well, I'll say upfront that I have not read this discussion in the fully, >> but apparently some opinions from developers of HDF5 Python packages would >> be welcome here, so here I go :) >> >> As a long-time developer of one of the Python HDF5 packages (PyTables), I >> have always been of the opinion that plain ASCII (for byte strings) and >> UCS-4 (for Unicode) encoding would be the appropriate dtypes for storing >> large amounts of data, most specially for disk storage (but also using >> compressed in-memory containers). My rational is that, although UCS-4 may >> require way too much space, compression would reduce that to basically the >> space that is required by compressed UTF-8 (I won't go into detail, but >> basically this is possible by using the shuffle filter). >> >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I >> don't think there is a go back (not even adding UCS-4 support on it, >> although I continue to think it would be a good idea). So, I suppose that >> if HDF5 is found to be an important format for NumPy users (and I think >> this is the case), a solution for representing Unicode characters by using >> UTF-8 in NumPy would be desirable (at the risk of making the implementation >> more complex). >> >> Francesc >> >> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@python.org >>> https://mail.python.org/mailman/listinfo/numpy-discussion >>> >>> >> >> >> -- >> Francesc Alted >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@python.org >> https://mail.python.org/mailman/listinfo/numpy-discussion >> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > > -- Francesc Alted
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion