On 1/13/2014 4:54 AM, wxjmfa...@gmail.com wrote:

I'm afraid I'm understanding Python (on this
aspect very well).

Really?

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

To me, the important question is whether this and previous similar posts are intentional trolls designed to stir up the flurry of responses they get or 'innocently' misleading or even erroneous. If your claim of understanding Python and Unicode is true, then this must be a troll post. Either way, please desist, or your access to python-list from google-groups may be removed.

'ß' is the the fourth character in that text "Straße"
(base index 0).

As others have said, in the *unicode text "Straße", 'ß' is the fifth character, at character index 4, ...

This assertions are correct (byte string and unicode).

whereas, when the text is encoded into bytes, the byte index depends on the encoding and the assertion that it is always 4 is incorrect. Did you know this or were you truly ignorant?

sys.version
'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
assert 'Straße'[4] == 'ß'

Sometimes true, sometimes not.

assert u'Straße'[4] == u'ß'

PS Nothing to do with Py2/Py3.

This issue has everything to do with Py2, where 'Straße' is encoded bytes, versus Py3, where 'Straße' is unicode text where each character of that word takes one code unit, whether each is 2 bytes or 4 bytes.

If you replace 'ß' with any astral (non-BMP) character, this issue appears even for unicode text in 3.2-, where an astral character requires 2, not 1, code units on narrow builds, thereby screwing up indexing, just as can happen for encoded bytes. In 3.3+, all characters use 1 code unit and indexing (and slicing) always works properly. This is another unicode issue where you appear not to understand, but might just be trolling.

--
Terry Jan Reedy



--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to