Re: [Numpy-discussion] proposal: smaller representation of string arrays

Robert Kern Mon, 24 Apr 2017 16:13:18 -0700

On Mon, Apr 24, 2017 at 4:06 PM, Aldcroft, Thomas <
[email protected]> wrote:
>
> On Mon, Apr 24, 2017 at 4:06 PM, Robert Kern <[email protected]>
wrote:
>>
>> I am not unfamiliar with this problem. I still work with files that have
fields that are supposed to be in EBCDIC but actually contain text in
ASCII, UTF-8 (if I'm lucky) or any of a variety of East European 8-bit
encodings. In that experience, I have found that just treating the data as
latin-1 unconditionally is not a pragmatic solution. It's really easy to
implement, and you do get a program that runs without raising an exception
(at the I/O boundary at least), but you don't often get a program that
really runs correctly or treats the data properly.
>>
>> Can you walk us through the problems that you are having with working
with these columns as arrays of `bytes`?
>
> This is very simple and obvious but I will state for the record.


I appreciate it. What is obvious to you is not obvious to me.

> Reading an HDF5 file with character data currently gives arrays of
`bytes` [1].  In Py3 this cannot be compared to a string literal, and
comparing to (or assigning from) explicit byte strings everywhere in the
code quickly spins out of control.  This generally forces one to convert
the data to `U` type and incur the 4x memory bloat.
>
> In [22]: dat = np.array(['yes', 'no'], dtype='S3')
>
> In [23]: dat == 'yes'  # FAIL (but works just fine in Py2)
> Out[23]: False
>
> In [24]: dat == b'yes'  # Right answer but not practical
> Out[24]: array([ True, False], dtype=bool)

I'm curious why you think this is not practical. It seems like a very
practical solution to me.

--
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] proposal: smaller representation of string arrays

Reply via email to