On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <robert.k...@gmail.com> wrote:
> I am not unfamiliar with this problem. I still work with files that have > fields that are supposed to be in EBCDIC but actually contain text in > ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit > encodings. In that experience, I have found that just treating the data as > latin-1 unconditionally is not a pragmatic solution. It's really easy to > implement, and you do get a program that runs without raising an exception > (at the I/O boundary at least), but you don't often get a program that > really runs correctly or treats the data properly. > > Can you walk us through the problems that you are having with working with > these columns as arrays of `bytes`? > This is very simple and obvious but I will state for the record. Reading an HDF5 file with character data currently gives arrays of `bytes` [1]. In Py3 this cannot be compared to a string literal, and comparing to (or assigning from) explicit byte strings everywhere in the code quickly spins out of control. This generally forces one to convert the data to `U` type and incur the 4x memory bloat. In [22]: dat = np.array(['yes', 'no'], dtype='S3') In [23]: dat == 'yes' # FAIL (but works just fine in Py2) Out[23]: False In [24]: dat == b'yes' # Right answer but not practical Out[24]: array([ True, False], dtype=bool) - Tom [1]: Using h5py or pytables. Same with FITS, although astropy.io.fits does some tricks under the hood to auto-convert to `U` type as needed. > > > > So I would beg to actually move forward with a pragmatic solution that > addresses very real and consequential problems that we face instead of > waiting/praying for a perfect solution. > > Well, I outlined a solution: work with `bytes` arrays with utilities to > convert to/from the Unicode-aware string dtypes (or `object`). > > A UTF-8-specific dtype and maybe a string-specialized `object` dtype > address the very real and consequential problems that I face (namely and > respectively, working with HDF5 and in-memory manipulation of string > datasets). > > I'm happy to consider a latin-1-specific dtype as a second, > workaround-for-specific-applications-only-you-have-been- > warned-you're-gonna-get-mojibake option. It should not be *the* Unicode > string dtype (i.e. named np.realstring or np.unicode as in the original > proposal). > > -- > Robert Kern > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@python.org > https://mail.python.org/mailman/listinfo/numpy-discussion > >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion