Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel Smith wrote: > >> It's worthwhile enough that both major HDF5 bindings don't support >> Unicode arrays, despite user requests for years. The sticking point seems >> to be the difference between HDF5's view of a

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Neal Becker
So while compression+ucs-4 might be OK for out-of-core representation, what about in-core? blosc+ucs-4? I don't think that works for mmap, does it? On Thu, Apr 27, 2017 at 7:11 AM Francesc Alted wrote: > 2017-04-27 3:34 GMT+02:00 Stephan Hoyer : > >> On Wed, Apr 26, 2017 at 4:49 PM, Nathaniel

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 13:27 GMT+02:00 Neal Becker : > So while compression+ucs-4 might be OK for out-of-core representation, > what about in-core? blosc+ucs-4? I don't think that works for mmap, does > it? > ​Correct, the real problem is mmap for an out-of-core, HDF5 representation, I presume. For in-mem

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Chris Barker
On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > I remember advocating for UCS-4 adoption in the HDF5 library many years > ago (2007?), but I had no success and UTF-8 was decided to be the best > candidate. So, the boat with HDF5 using UTF-8 sailed many years ago, and I > don't think the

Re: [Numpy-discussion] proposal: smaller representation of string arrays

2017-04-27 Thread Francesc Alted
2017-04-27 18:18 GMT+02:00 Chris Barker : > On Thu, Apr 27, 2017 at 4:10 AM, Francesc Alted wrote: > >> I remember advocating for UCS-4 adoption in the HDF5 library many years >> ago (2007?), but I had no success and UTF-8 was decided to be the best >> candidate. So, the boat with HDF5 using UTF