On Fri, Jan 17, 2014 at 4:43 PM, <josef.p...@gmail.com> wrote: > On Fri, Jan 17, 2014 at 4:20 PM, Chris Barker <chris.bar...@noaa.gov> > wrote: > > On Fri, Jan 17, 2014 at 12:36 PM, <josef.p...@gmail.com> wrote: > >> > >> > ('S' ?) -- which is probably not what you want particularly if you > >> > specify > >> > an encoding. Though I can't figure out at the moment why the previous > >> > one > >> > failed -- where did the bytes object come from when the encoding was > >> > specified? > >> > >> Yes, it's a utf-8 file with nonascii. > >> > >> I don't know what I **should** want. > > > > > > well, you **should** want: > > > > The numbers parsed out for you (Other wise, why use recfromtxt), and the > > text as properly decoded unicode strings. > > > > Python does very well with unicode -- and you are MUCH happier if you do > the > > encoding/decoding as close to I/O as possible. recfromtxt is, in a way, > > decoding already, converting ascii representation of numbers to an > internal > > binary representation -- why not handle the text at the same time. > > > > There certainly are use cases for keeping the text as encoded bytes, but > I'd > > say those fall into the categories of: > > > > 1) Special case > > 2) You should know what you are doing. > > > > So having recfromtxt auto-determine that for you makes little sense. > > > > Note that if you don't know the file encoding, this is tricky. My > thoughts: > > > > 1) don't use the system default encoding!!! (see my other note on that!) > > > > 2) Either: > > a) open as a binary file and use bytes for anything that doesn't > parse > > as text -- this means that the user will need to do the conversion to > text > > themselves > > > > b) decode as latin-1: this would work well for ascii and _some_ > non-ascii > > text, and would be recoverable for ALL text. > > > > I prefer (b). The point here is that if the user gets bytes, then they > will > > either have to assume ascii, or need to hand-decode it, and if they just > > want assume ascii, they have a bytes object with limited text > functionality > > so will probably need to decode it anyway (unless they are just passing > it > > through) > > > > If the user gets unicode objects that are may not properly decoded, they > can > > either assume it was ascii, and if they only do ascii-compatible things > with > > it, it will work, or they can encode/decode it and get the proper stuff > > back, but only if they know the encoding, and if that's the case, why did > > they not specify that in the first place? > > > >> > >> For now I do want bytes, because that's how I changed statsmodels in > >> the py3 conversion. > >> > >> This was just based on the fact that recfromtxt doesn't work with > >> strings on python 3, so I switched to using bytes following the lead > >> of numpy. > > > > > > Well, that's really too bad -- it doesn't sound like you wanted bytes, it > > sounds like you wanted something that didn't crash -- fair enough. But > the > > "proper" solution is for recfromtext to support text.... > > But also solution 2a) is fine for most of the code > Often it doesn't really matter > > >>> dta_4 > array([(1, 2, 3, b'hello', 'hello'), > (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'), > (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', > 'Õscar')], > dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', > 'S10'), ('f4', '<U9')]) > > >>> (dta_4['f3'][:, None] == np.unique(dta_4['f3'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > >>> (dta_4['f4'][:, None] == np.unique(dta_4['f4'])).astype(int) > array([[1, 0, 0], > [0, 0, 1], > [1, 0, 0], > [0, 1, 0]]) > > similar doing a for loop comparing to the uniques. > bytes are fine and nobody has to tell me what encoding they are using. >
>From my perspective bytes are not fine, at least if you want to use normal string literals in Python 3: In [64]: dat Out[64]: array([(1, 2, 3, b'hello', 'hello'), (5, 6, 7, b'\xc3\x95scarscar', 'Õscarscar'), (15, 2, 3, b'hello', 'hello'), (20, 2, 3, b'\xc3\x95scar', 'Õscar')], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<U9')]) In [65]: dat['f3'] == 'hello' # this is how I would find "hello" in my array, FAIL Out[65]: False In [66]: dat['f3'] == b'hello' # OK, I have to use a bytestring literal Out[66]: array([ True, False, True, False], dtype=bool) In [67]: dat['f4'] == 'hello' # Works as expected for unicode field Out[67]: array([ True, False, True, False], dtype=bool) And then when you want to look at your data it continues to be difficult: In [80]: 'The 3rd element of f3 is "%s"' % dat['f3'][2] Out[80]: 'The 3rd element of f3 is "b\'hello\'"' In [81]: 'The 3rd element of f3 is "%s"' % dat['f3'][2].decode('ascii') # SIGH Out[81]: 'The 3rd element of f3 is "hello"' +1 for something like the latin-1 or ascii unicode dtype that can make it a lot easier for things to just work. - Tom p.s. I usually use format(), not %. Alas I ran into what I think is an old bug: In [82]: 'The 3rd element of f3 is "{}"'.format(dat['f3'][3]) ERROR: RuntimeError: maximum recursion depth exceeded while calling a Python object [IPython.core.interactiveshell] --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-82-a7f1f486497d> in <module>() ----> 1 'The 3rd element of f3 is "{}"'.format(dat['f3'][3]) RuntimeError: maximum recursion depth exceeded while calling a Python object > > It doesn't work so well for pretty printing results, so using there > latin-1 as you describe above might be a good solution if users don't > decode to text/string > > Josef > > > > >> I'm mainly worried about backwards compatibility, since we have been > >> using this for 2 or 3 years. It would be easy to change in statsmodels > >> when gen/recfromtxt is fixed, but I assume there is lots of other code > >> using similar interpretation of S/bytes in numpy. > > > > > > well, it does sound like enough folks are using 'S' to mean bytes -- too > > bad, but what can we do now about that? > > > > I'd like a 's' for ascii-stings though. > > > > -Chris > > > > -- > > > > Christopher Barker, Ph.D. > > Oceanographer > > > > Emergency Response Division > > NOAA/NOS/OR&R (206) 526-6959 voice > > 7600 Sand Point Way NE (206) 526-6329 fax > > Seattle, WA 98115 (206) 526-6317 main reception > > > > chris.bar...@noaa.gov > > > > _______________________________________________ > > NumPy-Discussion mailing list > > NumPy-Discussion@scipy.org > > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion