Re: Bytes indexing returns an int

2014-01-09 Thread Piet van Oostrum
Ned Batchelder n...@nedbatchelder.com writes:

 On 1/8/14 11:08 AM, wxjmfa...@gmail.com wrote:
 Byte strings (encoded code points) or native unicode is one
 thing.

 But on the other side, the problem is elsewhere. These very
 talented ascii narrow minded, unicode illiterate devs only
 succeded to produce this (I, really, do not wish to be rude).

 If you don't want to be rude, you are failing.  You've been told a
 number of times that your obscure micro-benchmarks are meaningless.  Now
 you've taken to calling the core devs narrow-minded and Unicode
 illiterate.  They are neither of these things.

 Continuing to post these comments with no interest in learning is rude.
 Other recent threads have contained details rebuttals of your views,
 which you have ignored.  This is rude. Please stop.

Please ignore jmf's repeated nonsense.
-- 
Piet van Oostrum p...@vanoostrum.org
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-09 Thread Ethan Furman

On 01/09/2014 09:05 AM, Piet van Oostrum wrote:

Ned Batchelder n...@nedbatchelder.com writes:


On 1/8/14 11:08 AM, wxjmfa...@gmail.com wrote:

Byte strings (encoded code points) or native unicode is one
thing.

But on the other side, the problem is elsewhere. These very
talented ascii narrow minded, unicode illiterate devs only
succeded to produce this (I, really, do not wish to be rude).


If you don't want to be rude, you are failing.  You've been told a
number of times that your obscure micro-benchmarks are meaningless.  Now
you've taken to calling the core devs narrow-minded and Unicode
illiterate.  They are neither of these things.

Continuing to post these comments with no interest in learning is rude.
Other recent threads have contained details rebuttals of your views,
which you have ignored.  This is rude. Please stop.


Please ignore jmf's repeated nonsense.


Or ban him.  His one, minor, contribution has been completely swamped by the rest of his belligerent, unfounded, refuted 
posts.


--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-09 Thread Serhiy Storchaka

09.01.14 19:28, Ethan Furman написав(ла):

On 01/09/2014 09:05 AM, Piet van Oostrum wrote:

Please ignore jmf's repeated nonsense.


Or ban him.  His one, minor, contribution has been completely swamped by
the rest of his belligerent, unfounded, refuted posts.


Please not. I have a fun from every his appearance.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-08 Thread Robin Becker

On 07/01/2014 19:48, Serhiy Storchaka wrote:


data[0] == b'\xE1'[0] works as expected in both Python 2.7 and 3.x.


I have been porting a lot of python 2 only code to a python2.7 + 3.3 version for 
a few months now. Bytes indexing was a particular problem. PDF uses quite a lot 
of single byte indicators so code like


if text[k] == 'R':
   .

or

dispatch_dict.get(text[k],error)()

is much harder to make compatible because of this issue. I think this change was 
a mistake.


To get round this I have tried the following class to resurrect the old style 
behaviour


if isPy3:
class RLBytes(bytes):
'''simply ensures that B[x] returns a bytes type object and not 
an int'''
def __getitem__(self,x):
if isinstance(x,int):
return RLBytes([bytes.__getitem__(self,x)])
else:
return RLBytes(bytes.__getitem__(self,x))

I'm not sure if that covers all possible cases, but it works for my dispatching 
cases. Unfortunately you can't do simple class assignment to change the 
behaviour so you have to copy the text.


I find a lot of the so glad we got rid of byte strings fervour a bit silly. 
Bytes, chars,  words etc etc were around long before unicode. Byte strings could 
already represent unicode in efficient ways that happened to be useful for 
western languages. Having two string types is inconvenient and error prone, 
swapping their labels and making subtle changes is a real pain.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-08 Thread wxjmfauth
Le mercredi 8 janvier 2014 12:05:49 UTC+1, Robin Becker a écrit :
 On 07/01/2014 19:48, Serhiy Storchaka wrote:
 
 
 
  data[0] == b'\xE1'[0] works as expected in both Python 2.7 and 3.x.
 
 
 
 
 
 I have been porting a lot of python 2 only code to a python2.7 + 3.3 version 
 for 
 
 a few months now. Bytes indexing was a particular problem. PDF uses quite a 
 lot 
 
 of single byte indicators so code like
 
 
 
 if text[k] == 'R':
 
 .
 
 
 
 or
 
 
 
 dispatch_dict.get(text[k],error)()
 
 
 
 is much harder to make compatible because of this issue. I think this change 
 was 
 
 a mistake.
 
 
 
 To get round this I have tried the following class to resurrect the old style 
 
 behaviour
 
 
 
 if isPy3:
 
   class RLBytes(bytes):
 
   '''simply ensures that B[x] returns a bytes type object and not 
 an int'''
 
   def __getitem__(self,x):
 
   if isinstance(x,int):
 
   return RLBytes([bytes.__getitem__(self,x)])
 
   else:
 
   return RLBytes(bytes.__getitem__(self,x))
 
 
 
 I'm not sure if that covers all possible cases, but it works for my 
 dispatching 
 
 cases. Unfortunately you can't do simple class assignment to change the 
 
 behaviour so you have to copy the text.
 
 
 
 I find a lot of the so glad we got rid of byte strings fervour a bit silly. 
 
 Bytes, chars,  words etc etc were around long before unicode. Byte strings 
 could 
 
 already represent unicode in efficient ways that happened to be useful for 
 
 western languages. Having two string types is inconvenient and error prone, 
 
 swapping their labels and making subtle changes is a real pain.
 
 -- 


--

Byte strings (encoded code points) or native unicode is one
thing.

But on the other side, the problem is elsewhere. These very 
talented ascii narrow minded, unicode illiterate devs only
succeded to produce this (I, really, do not wish to be rude).

 import unicodedata
 unicodedata.name('ǟ')
'LATIN SMALL LETTER A WITH DIAERESIS AND MACRON'
 sys.getsizeof('a')
26
 sys.getsizeof('ǟ')
40
 timeit.timeit(unicodedata.normalize('NFKD', 'ǟ'), import unicodedata)
0.804001575129
 timeit.timeit(unicodedata.normalize('NFKD', 'zzz'), import unicodedata)
0.3073749330963995
 timeit.timeit(unicodedata.normalize('NFKD', 'z'), import unicodedata)
0.2874013282653962
 
 timeit.timeit(len(unicodedata.normalize('NFKD', 'zzz')), import 
 unicodedata)
0.3803570633857589
 timeit.timeit(len(unicodedata.normalize('NFKD', 'ǟ')), import 
 unicodedata)
0.9359970320201683

pdf, typography, linguistic, scripts, ... in mind, in other word the real
*unicode* world.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-08 Thread Ned Batchelder

On 1/8/14 11:08 AM, wxjmfa...@gmail.com wrote:

Byte strings (encoded code points) or native unicode is one
thing.

But on the other side, the problem is elsewhere. These very
talented ascii narrow minded, unicode illiterate devs only
succeded to produce this (I, really, do not wish to be rude).


If you don't want to be rude, you are failing.  You've been told a 
number of times that your obscure micro-benchmarks are meaningless.  Now 
you've taken to calling the core devs narrow-minded and Unicode 
illiterate.  They are neither of these things.


Continuing to post these comments with no interest in learning is rude. 
Other recent threads have contained details rebuttals of your views, 
which you have ignored.  This is rude. Please stop.


--Ned.




import unicodedata
unicodedata.name('ǟ')

'LATIN SMALL LETTER A WITH DIAERESIS AND MACRON'

sys.getsizeof('a')

26

sys.getsizeof('ǟ')

40

timeit.timeit(unicodedata.normalize('NFKD', 'ǟ'), import unicodedata)

0.804001575129

timeit.timeit(unicodedata.normalize('NFKD', 'zzz'), import unicodedata)

0.3073749330963995

timeit.timeit(unicodedata.normalize('NFKD', 'z'), import unicodedata)

0.2874013282653962


timeit.timeit(len(unicodedata.normalize('NFKD', 'zzz')), import unicodedata)

0.3803570633857589

timeit.timeit(len(unicodedata.normalize('NFKD', 'ǟ')), import unicodedata)

0.9359970320201683

pdf, typography, linguistic, scripts, ... in mind, in other word the real
*unicode* world.

jmf




--
Ned Batchelder, http://nedbatchelder.com

--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-08 Thread Michael Torrie
On 01/08/2014 09:08 AM, wxjmfa...@gmail.com wrote:
 Byte strings (encoded code points) or native unicode is one
 thing.

Byte strings are not necessarily encoded code points.  Most byte
streams I work with are definitely not unicode! They are in fact things
such as BER-encoded ASN.1 data structures.  Or PDF data streams.  Or
Gzip data streams.  This issue in this thread has nothing to do with
unicode.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Ervin Hegedüs
hi,

On Tue, Jan 07, 2014 at 10:13:29PM +1100, Steven D'Aprano wrote:
 Does anyone know what the rationale behind making byte-string indexing
 return an int rather than a byte-string of length one?
 
 That is, given b = b'xyz', b[1] returns 121 rather than b'y'.
 
 This is especially surprising when one considers that it's easy to extract
 the ordinal value of a byte:
 
 ord(b'y') = 121

Which Python version?

http://docs.python.org/2/reference/lexical_analysis.html#strings
A prefix of 'b' or 'B' is ignored in Python 2;

if you want to store the string literal as byte array, you have
to use bytearray() function:

 a = bytearray('xyz')
 a
bytearray(b'xyz')
 a[0]
120
 a[1]
121


http://docs.python.org/2/library/stdtypes.html
5.6. Sequence Types


hth,


a.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Steven D'Aprano
Ervin Hegedüs wrote:

 hi,
 
 On Tue, Jan 07, 2014 at 10:13:29PM +1100, Steven D'Aprano wrote:
 Does anyone know what the rationale behind making byte-string indexing
 return an int rather than a byte-string of length one?
 
 That is, given b = b'xyz', b[1] returns 121 rather than b'y'.
 
 This is especially surprising when one considers that it's easy to
 extract the ordinal value of a byte:
 
 ord(b'y') = 121
 
 Which Python version?

My apologies... I've been so taken up with various threads on this list
discussing Python 3, I forgot to mention that I'm talking about Python 3.

I understand the behaviour of bytes and bytearray, I'm asking *why* that
specific behaviour was chosen.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Terry Reedy

On 1/7/2014 6:13 AM, Steven D'Aprano wrote:

Does anyone know what the rationale behind making byte-string indexing
return an int rather than a byte-string of length one?

That is, given b = b'xyz', b[1] returns 121 rather than b'y'.


This former is the normal behavior of sequences, the latter is peculiar 
to strings, because there is no separate character class. A byte is a 
count n, 0 = n  256 and bytes and bytearrays are sequences of bytes. 
It was ultimately Guido's decision after some discussion and debate on, 
I believe, the py3k list. I do not remember enough to be any more specific.


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread David Robinow
treating bytes as chars considered harmful?
 I don't know the answer to your question but the behavior seems right to me.
Python 3 grudgingly allows the abomination of byte strings (is that
what they're called? I haven't fully embraced Python3 yet). If you
want a substring you use a slice.
   b = b'xyz'
   b[1:2] = b'y'

also, chr(121) = 'y'   which is really what the Python 3 gods prefer.

On Tue, Jan 7, 2014 at 6:13 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Does anyone know what the rationale behind making byte-string indexing
 return an int rather than a byte-string of length one?

 That is, given b = b'xyz', b[1] returns 121 rather than b'y'.

 This is especially surprising when one considers that it's easy to extract
 the ordinal value of a byte:

 ord(b'y') = 121



 --
 Steven

 --
 https://mail.python.org/mailman/listinfo/python-list
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread David Robinow
Sorry for top-posting. I thought I'd mastered gmail.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Steven D'Aprano
David Robinow wrote:

 treating bytes as chars considered harmful?

Who is talking about treating bytes as chars? You're making assumptions that
aren't justified by my question.


  I don't know the answer to your question but the behavior seems right to
  me.

This issue was raised in an earlier discussion about *binary data* in Python
3. (The earlier discussion also involved some ASCII-encoded text, but
that's actually irrelevant to the issue.) In Python 2.7, if you have a
chunk of binary data, you can easily do this:

data = b'\xE1\xE2\xE3\xE4'
data[0] == b'\xE1'

and it returns True just as expected. It even works if that binary data
happens to look like ASCII text:

data = b'\xE1a\xE2\xE3\xE4'
data[1] == b'a'

But in Python 3, the same code silently returns False in both cases, because
indexing a bytes object gives an int. So you have to write something like
these, all of which are ugly or inelegant:

data = b'\xE1a\xE2\xE3\xE4'
data[1] == 0x61
data[1] == ord(b'a')
chr(data[1]) == 'a'
data[1:2] == b'a'


I believe that only the last one, the one with the slice, works in both
Python 2.7 and Python 3.x.


 Python 3 grudgingly allows the abomination of byte strings (is that
 what they're called? I haven't fully embraced Python3 yet).

They're not abominations. They exist for processing bytes (hence the name)
and other binary data. They are necessary for low-level protocols, for
dealing with email, web, files, and similar. Application code may not need
to deal with bytes, but that is only because the libraries you call do the
hard work for you.

People trying to port these libraries from 2.7 to 3 run into this problem,
and it causes them grief. This little difference between bytes in 2.7 and
bytes in 3.x is a point of friction which makes porting harder, and I'm
trying to understand the reason for it.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Ethan Furman

On 01/07/2014 07:19 AM, David Robinow wrote:


Python 3 grudgingly allows the abomination of byte strings (is that
what they're called?)


No, that is *not* what they're called.  If you find any place in the Python3 docs that does call them bytestrings please 
submit a bug report.



On 01/07/2014 08:12 AM, Steven D'Aprano wrote:

People trying to port these libraries from 2.7 to 3 run into this problem,
and it causes them grief. This little difference between bytes in 2.7 and
bytes in 3.x is a point of friction which makes porting harder, and I'm
trying to understand the reason for it.


If I recall correctly the way it was explained to me:

bytes (lists, arrays, etc.) is a container, and when a container is indexed you get whatever the container held.  If you 
slice the container you get a smaller container with the appropriate items.


bytes (and bytearrays) are containers of ints, so indexing returns an int.  One big problem with this whole scenario is 
that bytes then lies about what it contains.  (And I hate lies! [1])


Anyway, I believe that's the rationale behind the change.

--
~Ethan~

[1] http://www.quickmeme.com/meme/3ts325
--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Serhiy Storchaka

07.01.14 18:12, Steven D'Aprano написав(ла):

In Python 2.7, if you have a
chunk of binary data, you can easily do this:

data = b'\xE1\xE2\xE3\xE4'
data[0] == b'\xE1'

and it returns True just as expected.


data[0] == b'\xE1'[0] works as expected in both Python 2.7 and 3.x.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Steven D'Aprano
Ethan Furman wrote:

 On 01/07/2014 07:19 AM, David Robinow wrote:

 Python 3 grudgingly allows the abomination of byte strings (is that
 what they're called?)
 
 No, that is *not* what they're called.  If you find any place in the
 Python3 docs that does call them bytestrings please submit a bug report.

The name of the class is bytes, but what they represent *is* a string of
bytes, hence byte-string. It's a standard computer science term for
distinguishing strings of text from strings of bytes.


 On 01/07/2014 08:12 AM, Steven D'Aprano wrote:
 People trying to port these libraries from 2.7 to 3 run into this
 problem, and it causes them grief. This little difference between bytes
 in 2.7 and bytes in 3.x is a point of friction which makes porting
 harder, and I'm trying to understand the reason for it.
 
 If I recall correctly the way it was explained to me:
 
 bytes (lists, arrays, etc.) is a container, and when a container is
 indexed you get whatever the container held.  If you slice the container
 you get a smaller container with the appropriate items.

(There's also a bytearray type, which is best considered as an array. Hence
the name.) Why decide that the bytes type is best considered as a list of
bytes rather than a string of bytes? It doesn't have any list methods, it
looks like a string and people use it as a string. As you have discovered,
it is an inconvenient annoyance that indexing returns an int instead of a
one-byte byte-string.

I think that, in hindsight, this was a major screw-up in Python 3.



 bytes (and bytearrays) are containers of ints, so indexing returns an int.
 One big problem with this whole scenario is
 that bytes then lies about what it contains.  (And I hate lies! [1])
 
 Anyway, I believe that's the rationale behind the change.
 
 --
 ~Ethan~
 
 [1] http://www.quickmeme.com/meme/3ts325

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Chris Angelico
On Wed, Jan 8, 2014 at 11:15 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Why decide that the bytes type is best considered as a list of
 bytes rather than a string of bytes? It doesn't have any list methods, it
 looks like a string and people use it as a string. As you have discovered,
 it is an inconvenient annoyance that indexing returns an int instead of a
 one-byte byte-string.

 I think that, in hindsight, this was a major screw-up in Python 3.

Which part was? The fact that it can be represented with a (prefixed)
quoted string?

bytes_value = (41, 42, 43, 44)
string = bytes_value.decode()  # ABCD

I think it's more convenient to let people use a notation similar to
what was used in Py2, but perhaps this is an attractive nuisance, if
it gives rise to issues like this. If a bytes were more like a tuple
of ints (not a list - immutability is closer) than it is like a
string, would that be clearer?

Perhaps the solution isn't even a code one, but a mental one. A bytes
is like a tuple of ints might be a useful mantra.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Ethan Furman

On 01/07/2014 04:15 PM, Steven D'Aprano wrote:

Ethan Furman wrote:

On 01/07/2014 07:19 AM, David Robinow wrote:


Python 3 grudgingly allows the abomination of byte strings (is that
what they're called?)


No, that is *not* what they're called.  If you find any place in the
Python3 docs that does call them bytestrings please submit a bug report.


The name of the class is bytes, but what they represent *is* a string of
bytes, hence byte-string. It's a standard computer science term for
distinguishing strings of text from strings of bytes.


I do not disagree with your statements, yet calling the bytes type a bytestring suggests things which are not true, such 
as indexing returning another bytestring.  The core-dev have thus agreed to not call it that in the documentation in 
hopes of lessening any confusion.




On 01/07/2014 08:12 AM, Steven D'Aprano wrote:

People trying to port these libraries from 2.7 to 3 run into this
problem, and it causes them grief. This little difference between bytes
in 2.7 and bytes in 3.x is a point of friction which makes porting
harder, and I'm trying to understand the reason for it.


If I recall correctly the way it was explained to me:

bytes (lists, arrays, etc.) is a container, and when a container is
indexed you get whatever the container held.  If you slice the container
you get a smaller container with the appropriate items.


(There's also a bytearray type, which is best considered as an array. Hence
the name.) Why decide that the bytes type is best considered as a list of
bytes rather than a string of bytes? It doesn't have any list methods, it
looks like a string and people use it as a string. As you have discovered,
it is an inconvenient annoyance that indexing returns an int instead of a
one-byte byte-string.

I think that, in hindsight, this was a major screw-up in Python 3.


The general consensus seems to be agreement (more or less) with that feeling, 
but alas it is too late to change it now.

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Grant Edwards
On 2014-01-08, Chris Angelico ros...@gmail.com wrote:
 On Wed, Jan 8, 2014 at 11:15 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Why decide that the bytes type is best considered as a list of
 bytes rather than a string of bytes? It doesn't have any list methods, it
 looks like a string and people use it as a string. As you have discovered,
 it is an inconvenient annoyance that indexing returns an int instead of a
 one-byte byte-string.

 I think that, in hindsight, this was a major screw-up in Python 3.

 Which part was?

The fact that b'ASDF'[0] in Python2 yeilds something different than it
does in Python3 -- one yields b'A' and the other yields 0x41.  It
makes portable code a lot harder to write.  I don't really have any
preference for one over the other, but changing it for no apparent
reason was a horrible idea.

-- 
Grant
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Bytes indexing returns an int

2014-01-07 Thread Chris Angelico
On Wed, Jan 8, 2014 at 1:34 PM, Grant Edwards invalid@invalid.invalid wrote:
 On 2014-01-08, Chris Angelico ros...@gmail.com wrote:
 I think that, in hindsight, this was a major screw-up in Python 3.

 Which part was?

 The fact that b'ASDF'[0] in Python2 yeilds something different than it
 does in Python3 -- one yields b'A' and the other yields 0x41.

Fair enough. Either can be justified, changing is awkward.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list