Re: How do I display unicode value stored in a string variable using ord()

2012-08-22 Thread Hans Mulder
On 19/08/12 19:48:06, Paul Rubin wrote:
 Terry Reedy tjre...@udel.edu writes:
  py s = chr(0x + 1)
  py a, b = s
 That looks like a 3.2- narrow build. Such which treat unicode strings
 as sequences of code units rather than sequences of codepoints. Not an
 implementation bug, but compromise design that goes back about a
 decade to when unicode was added to Python.

Actually, this compromise design was new in 3.0.

In 2.x, unicode strings were sequences of code points.
Narrow builds rejected any code points  0x:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type help, copyright, credits or license for more information.
 s = unichr(0x + 1)
Traceback (most recent call last):
  File stdin, line 1, in module
ValueError: unichr() arg not in range(0x1) (narrow Python build)


-- HansM
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-21 Thread Neil Hodgson

Steven D'Aprano:


Using variable-sized strings like UTF-8 and UTF-16 for in-memory
representations is a terrible idea because you can't assume that people
will only every want to index the first or last character. On average,
you need to scan half the string, one character at a time. In Big-Oh, we
can ignore the factor of 1/2 and just say we scan the string, O(N).


   In the majority of cases you can remove excessive scanning by 
caching the most recent index-offset result. If the next index request 
is nearer the cached index than to the beginning then iterate from that 
offset. This converts many operations from quadratic to linear. Locality 
of reference is common and can often be reasonably exploited.


   However, exposing the variable length nature of UTF-8 allows the 
application to choose efficient techniques for more cases.


   Neil
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-20 Thread Steven D'Aprano
On Mon, 20 Aug 2012 00:44:22 -0400, Roy Smith wrote:

 In article 5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com,
  Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 
  So it may be with utf-8 someday.
 
 Only if you believe that people's ability to generate data will remain
 lower than people's ability to install more storage.
 
 We're not talking *data*, we're talking *text*.  Most of those
 whatever-bytes people are generating are images, video, and music.  Text
 is a pittance compared to those.

Paul Rubin already told you about his experience using OCR to generate 
multiple terrabytes of text, and how he would not be happy if that was 
stored in UCS-4.

HTML is text. XML is text. SVG is text. Source code is text. Email is 
text. (Well, it's actually bytes, but it looks like ASCII text.) Log 
files are text, and they can fill a hard drive pretty quickly. Lots of 
data is text.

Pittance or not, I do not believe that people will widely abandon compact 
storage formats like UTF-8 and Latin-1 for UCS-4 any time soon. Given 
that we're still trying to convince people to use UTF-8 over ASCII, I 
reckon it will be at least 40 years before there's even a slim chance of 
migrating from UTF-8 to UCS-4 in a widespread manner. In the IT world, 
that's close enough to never -- we might not even be using Unicode in 
2052.

In any case, time will tell who is right.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-20 Thread rusi
On Aug 19, 11:11 pm, wxjmfa...@gmail.com wrote:
 Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :



  But they are not ascii pages, they are (as stated) MOSTLY ascii.

  E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses

  a much more memory-expensive encoding than UTF-8.



 Well, it seems some software producers know what they
 are doing.

  '€'.encode('cp1252')
 b'\x80'
  '€'.encode('mac-roman')
 b'\xdb'
  '€'.encode('iso-8859-1')

 Traceback (most recent call last):
   File eta last command, line 1, in module
 UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
 in position 0: ordinal not in range(256)

facetious
You want the Euro-sign in iso-8859-1??
I object. I want the rupee sign ( ₹ ) 
http://en.wikipedia.org/wiki/Indian_rupee_sign

And while we are at it, why not move it (both?) into ASCII?
/facetious

The problem(s) are:
1. We dont really understand what you are objecting to.
2. Utf-8 like Huffman coding is a prefix code
http://en.wikipedia.org/wiki/Prefix_code#Prefix_codes_in_use_today
Like Huffman coding, it compresses based on a statistical argument.
3. Unlike Huffman coding the statistics is very political: Is the
Euro more important or Chinese ideograms? depends on whom you ask
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-20 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 Paul Rubin already told you about his experience using OCR to generate 
 multiple terrabytes of text, and how he would not be happy if that was 
 stored in UCS-4.

That particular text was stored on disk as compressed XML that had UTF-8
in the data fields, but I think Roy is right that it would have
compressed to around the same size in UCS-4.  Converting it to UCS-4 on
input would have bloated up the memory footprint and that was the issue
of concern to me.

 Pittance or not, I do not believe that people will widely abandon compact 
 storage formats like UTF-8 and Latin-1 for UCS-4 any time soon.

Looking at http://www.icu-project.org/ the C++ classes seem to use
UTF-16 sort like Python 3.2 :(.  I'm not certain of this though.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-20 Thread Piet van Oostrum
Blind Anagram non...@nowhere.com writes:

 This is an average slowdown by a factor of close to 2.3 on 3.3 when
 compared with 3.2.

 I am not posting this to perpetuate this thread but simply to ask
 whether, as you suggest, I should report this as a possible problem with
 the beta?

Being a beta release, is it certain that this release has been compiled
with the same optimization level as 3.2?
-- 
Piet van Oostrum p...@vanoostrum.org
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 Generally, I'm working with pure ASCII, but port those same algorithms
 to Python and you'll easily be able to read in a file in some known
 encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.  

 It's not so much 'random access to the nth character' as an efficient
 way of jumping forward. For instance, if I know that the next thing is
 a literal string of n characters (that I don't care about), I want to
 skip over that and keep parsing.

I don't understand how this is supposed to work.  You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
This is a long post. If you don't feel like reading an essay, skip to the 
very bottom and read my last few paragraphs, starting with To recap.


On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 (There is an extension to UCS-2, UTF-16, which encodes non-BMP
 characters using two code points. This is fragile and doesn't work very
 well, because string-handling methods can break the surrogate pairs
 apart, leaving you with invalid unicode string. Not good.)
 ...
 With PEP 393, each Python string will be stored in the most efficient
 format possible:
 
 Can you explain the issue of breaking surrogate pairs apart a little
 more?  Switching between encodings based on the string contents seems
 silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used 
for converting text as bytes for transmission over the wire or storage on 
disk. PEP 393 talks about the internal representation of text within 
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a 
narrow build, text is stored using two-bytes per character, so the 
string len (as in the name of the built-in function) will be stored as 

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is 
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much 
memory as needed. This standard data structure is called UCS-2, and it 
only handles characters in the Basic Multilingual Plane, the BMP (roughly 
the first 64000 Unicode code points). I'll come back to that.

In a wide build, text is stored as four-bytes per character, so len 
is stored as either:

006c 0065 006e
6c00 6500 6e00

Now memory is cheap, but it's not *that* cheap, and no matter how much 
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode 
character set, for now and forever. (If we ever need more that four-bytes 
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters 
[technically: code points] in the Basic Multilingual Plane? There's an 
extension to UCS-2 called UTF-16 which extends it to the entire Unicode 
range. Yes, that's the same name as the UTF-16 encoding, because it's 
more or less the same system.

UTF-16 says let's represent characters in the BMP by two bytes, but 
characters outside the BMP by four bytes. There's a neat trick to this: 
the BMP doesn't use the entire two-byte range, so there are some byte 
pairs which are illegal in UCS-2 -- they don't correspond to *any* 
character. UTF-16 used those byte pairs to signal this is half a 
character, you need to look at the next pair for the rest of the 
character.

Nifty hey? These pairs-of-pseudocharacters are called surrogate pairs.

Except this comes at a big cost: you can no longer tell how long a string 
is by counting the number of bytes, which is fast, because sometimes four 
bytes is two characters and sometimes it's one and you can't tell which 
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to 
grab the 10th characters in a string. The fast way using UCS-2 is to 
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we 
start counting at zero) and you're done. Fast and safe if you're willing 
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice 
as much space, so you probably end up spending so much time copying null 
bytes that you're probably slower anyway. Especially when your OS starts 
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 
and 9 are half of a surrogate pair, and you've now split the pair and 
ended up with an invalid string. That's what Python 3.2 does, it fails to 
handle surrogate pairs properly:

py s = chr(0x + 1)
py a, b = s
py a
'\ud800'
py b
'\udc00'


I've just split a single valid Unicode character into two invalid 
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that 
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair 
of bytes in order to index a string, or work out it's length, or copy a 
substring. It's not enough to just check if the last pair is a surrogate. 

When you don't, you have bugs like this from Python 3.2:

py s = 01234 + chr(0x + 1) + 6789
py s[9] == '9'
False
py s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the 
internal representation of strings -- they are either fast or correct but 
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 
is 

Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:

  I'm aware of this (and all the blah blah blah you are explaining).
  This always the same song. Memory.
 
 
 
 Exactly. The reason it is always the same song is because it is an
 important song.
 
 
 No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have 
unlimited amounts of memory. You must be very lucky.


 The same story as the coding of text files, where utf-8 == ascii and
 the rest of the world doesn't count.

UTF-8 is not ASCII.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

 The change does not just benefit ASCII users.  It primarily benefits
 anybody using a wide unicode build with strings mostly containing only
 BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings 
containing non-BMP characters, then you will see a big benefit.


 Even for narrow build users, there is the benefit that
 with approximately the same amount of memory usage in most cases, they
 no longer have to worry about non-BMP characters sneaking in and
 breaking their code.

Yes! +1000 on that.


 There is some additional benefit for Latin-1 users, but this has nothing
 to do with Python.  If Python is going to have the option of a 1-byte
 representation (and as long as we have the flexible representation, I
 can see no reason not to), 

The PEP explicitly states that it only uses a 1-byte format for ASCII 
strings, not Latin-1:

ASCII-only Unicode strings will again use only one byte per character

and later:

If the maximum character is less than 128, they use the PyASCIIObject 
structure

and:

The data and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient).


 then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number 
of 1-byte encodings, Latin-1 is hardly the only one.


 because that's what 1-byte Unicode (UCS-1, if you will) is.  If you have
 an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard 1-byte Unicode 
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too 
many to fit in a single byte. There is some historical justification for 
using Unicode to mean UCS-2, but with the standard being extended 
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers 
deliberately matched the Latin-1 standard for Unicode's first 256 code 
points. That's not the same thing though: there is no Unicode standard 
mapping to a single byte format.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

 As I understand (I think) the undelying mechanism, I can only say, it is
 not a surprise that it happens.
 
 Imagine an editor, I type an a, internally the text is saved as ascii,
 then I type en é, the text can only be saved in at least latin-1. Then
 I enter an €, the text become an internal ucs-4 string. The remove
 the € and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and 
so is nearly every character you're ever going to use unless you are 
Asian or a historian using some obscure ancient script. NONE of the 
examples you have shown in your emails have included 4-byte characters, 
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and 
misinterpreting what you have seen.


In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. 
That will not change. There is a tiny amount of fixed overhead for 
strings, and that overhead is slightly different between the versions, 
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text 
that you type is not the same as how Python does it. A text editor is not 
going to be creating a new immutable string after every key press. That 
will be slow slow SLOW. The usual way is to keep a buffer for each 
paragraph, and add and subtract characters from the buffer.


 Intuitively I expect there is some kind slow down between all these
 strings conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to 
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all 
we do is create new strings. First we create a string 'ab…', then we 
create another string 'ab…'*1000, then we create two new strings '…' and 
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just 
immediately create a new one and throw the old one away. You likely do 
work with that string:

steve@runes:~$ python3.2 -m timeit s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))
10 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))
10 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of 
deciding whether they should be stored using 1, 2 or 4 bytes begins to 
fade into the noise.


 When I tested this flexible representation, a few months ago, at the
 first alpha release. This is precisely what, I tested. String
 manipulations which are forcing this internal change and I concluded the
 result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable 
slow-down on Windows, report it as a bug.


 Does any body know a way to get the size of the internal string in
 bytes? 

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size
('abcœ…'*1000))
10030
steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size
('abcœ…'*1000))
10038


As I said, there is a *tiny* overhead difference. But identifiers will 
generally be smaller:

steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size
(size.__name__))
48
steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size
(size.__name__))
34

You can check the object overhead by looking at the size of the empty 
string.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:

 a will be stored as 1 byte/codepoint.
 
 Adding é, it will still be stored as 1 byte/codepoint.

Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1 
internally, it does not. Read the PEP, it explicitly states that 1-byte 
formats are only used for ASCII strings.


 Adding €, it will still be stored as 2 bytes/codepoint.

That is correct.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote:

 The problem with strings containing surrogate pairs is that you could
 inadvertently slice the string in the middle of the surrogate pair.

That's the *least* of the problems with surrogate pairs. That would be 
easy to fix: check the point of the slice, and back up or forward if 
you're on a surrogate pair. But that's not good enough, because the 
surrogates could be anywhere in the string. You have to touch every 
single character in order to know how many there are.

The problem with surrogate pairs is that they make basic string 
operations O(N) instead of O(1).



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Peter Otten
Steven D'Aprano wrote:

 On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
 
 a will be stored as 1 byte/codepoint.
 
 Adding é, it will still be stored as 1 byte/codepoint.
 
 Wrong. It will be 2 bytes, just like it already is in Python 3.2.
 
 I don't know where people are getting this myth that PEP 393 uses Latin-1
 internally, it does not. Read the PEP, it explicitly states that 1-byte
 formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) 
[GCC 4.6.1] on linux
Type help, copyright, credits or license for more information.
 import sys
 [sys.getsizeof(é*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
 [sys.getsizeof(e*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
 sys.getsizeof(é*101)-sys.getsizeof(é)
100
 sys.getsizeof(e*101)-sys.getsizeof(e)
100
 sys.getsizeof(€*101)-sys.getsizeof(€)
200

I infer that 

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) 
over ASCII-only.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:

 Scanning 4 characters (or a few dozen, say) to peel off a token in
 parsing a UTF-8 string is no big deal.  It gets more expensive if you
 want to index far more deeply into the string.  I'm asking how often
 that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through 
it looking for a match. Let's ignore the regular expression engine, since 
it has to look at every character anyway. But you've done your search and 
found your matching text and now want everything *after* it. That's not 
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]


Easy-peasy, right? But behind the scenes, you have a problem: how does 
Python know where text[end:] starts? With fixed-size characters, that's 
O(1): Python just moves forward end*width bytes into the string. Nice and 
fast.

With a variable-sized characters, Python has to start from the beginning 
again, and inspect each byte or pair of bytes. This turns the slice 
operation into O(N) and the combined op (search + slice) into O(N**2), 
and that starts getting *horrible*.

As always, everything is fast for small enough N, but you *really* 
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid 
character boundaries doesn't help you, because the string slice method 
cannot know where the indexes came from.

I suppose you could have a fast slice and a slow slice method, but 
really, that sucks, and besides all that does is pass responsibility for 
tracking character boundaries to the developer instead of the language, 
and you know damn well that they will get it wrong and their code will 
silently do the wrong thing and they'll say that Python sucks and we 
never used to have this problem back in the good old days with ASCII. Boo 
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For 
typical users, you end up wasting memory. That is the complaint driving 
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to 
multiply your string memory by four just in case somebody someday gives 
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably 
all you'll ever need. I hear that the club for people who have all the 
memory they'll ever need is holding their annual general meeting in a 
phone-booth this year.

You could say Screw the full Unicode standard, who needs more than 64K 
different characters anyway? Well apart from Asians, and historians, and 
a bunch of other people. If you can control your data and make sure no 
non-BMP characters are used, UCS-2 is fine -- except Python doesn't 
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up 
to the individual programmer to track character boundaries, and we know 
how well that works. Luckily the supplementary planes are only rarely 
used, and people who need them tend to buy more memory and use wide 
builds. People who only need a few non-BMP characters in a narrow build 
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings, 
turn them into suped-up ropes-on-steroids. All those extra indexes mean 
that you don't save any memory. Because the objects are so much bigger 
and more complex, your CPU cache goes to the dogs and your code still 
runs slow.

Which leaves us right back where we started, PEP 393.


 Obviously one can concoct hypothetical examples that would suffer.

If you think slicing at arbitrary indexes is a hypothetical example, I 
don't know what to say.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 This is a long post. If you don't feel like reading an essay, skip to the 
 very bottom and read my last few paragraphs, starting with To recap.

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post.  I can
only hope some readers will benefit from it.  I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice.  First of all, you wrote:

 This standard data structure is called UCS-2 ... There's an extension
 to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996.  UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

On to the main issue:

 * Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
 operations are not O(1) but are O(N). That means they are slow, or buggy, 
 pick one.

This I don't see.  What are the basic string operations?

* Examine the first character, or first few characters (few = usually
  bounded by a small constant) such as to parse a token from an input
  stream.  This is O(1) with either encoding.

* Slice off the first N characters.  This is O(N) with either encoding
  if it involves copying the chars.  I guess you could share references
  into the same string, but if the slice reference persists while the
  big reference is released, you end up not freeing the memory until
  later than you really should.

* Concatenate two strings.  O(N) either way.

* Find length of string.  O(1) either way since you'd store it in
  the string header when you build the string in the first place.
  Building the string has to have been an O(N) operation in either
  representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
  get a small slice from some random place in a big string.  This is
  where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.  That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

py s = chr(0x + 1)
py a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number.  This is O(1) if near the end
means within a constant.

 You could say Screw the full Unicode standard, who needs more than 64K 

No if you're claiming the language supports unicode it should be
the whole standard.

 You could do what Python 3.2 narrow builds do: use UTF-16 and leave it
 up to the individual programmer to track character boundaries,

I'm surprised the Python 3 implementers even considered that approach
much less went ahead with it.  It's obviously wrong.

 You could add a whole lot more heavyweight infrastructure to strings,
 turn them into suped-up ropes-on-steroids.

I'm not persuaded that PEP 393 isn't even worse.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin no.email@nospam.invalid wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]

 if end not near the end of the original string, then this is O(N)
 even with fixed-width representation, because of the char copying.

 if it is near the end, by knowing where the string data area
 ends, I think it should be possible to scan backwards from
 the end, recognizing what bytes can be the beginning of code points and
 counting off the appropriate number.  This is O(1) if near the end
 means within a constant.

Only if you know exactly where the end is (which requires storing and
maintaining a character length - this may already be happening, I
don't know). But that approach means you need to have code for both
ways (forward search or reverse), and of course it relies on your
encoding being reverse-scannable in this way (as UTF-8 is, but not
all).

And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 And of course, taking the *entire* rest of the string isn't the only
 thing you do. What if you want to take the next six characters after
 that index? That would be constant time with a fixed-width storage
 format.

How often is this an issue in practice?

I wonder how other languages deal with this.  The examples I can think
of are poor role models:

1. C/C++ - unicode impaired, other than a wchar type

2. Java - bogus UCS-2-like(?) representation for historical reasons
   Also has some modified UTF=8 for reasons that made no sense and
   that I don't remember

3. Haskell - basic string type is a linked list of code points.
   hello is five list nodes.  New Data.Text library (much more
efficient) uses something like ropes, I think, with UTF-16 underneath.

4. Erlang - I think like Haskell.  Efficiently handles byte blocks.

5. Perl 6 -- ???

6. Ruby - ??? (but probably quite slow like the rest of Ruby)

7. Objective C -- ???

8, 9 ...  (any other important ones?)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
About the exemples contested by Steven:

eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are 
not even aware, characters are coded. I'm the first 
to think, this is legitimate.

Memory or ability to treat all text in the same and equal
way?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:

 Steven D'Aprano wrote:

 I don't know where people are getting this myth that PEP 393 uses
 Latin-1 internally, it does not. Read the PEP, it explicitly states
 that 1-byte formats are only used for ASCII strings.
 
 From
 
 Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) [GCC
 4.6.1] on linux
 Type help, copyright, credits or license for more information.
 import sys
 [sys.getsizeof(é*i) for i in range(10)]
 [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because 
that would explain why your sizes are so larger than mine:

py [sys.getsizeof(é*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py [sys.getsizeof(€*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py c = chr(0x + 1)
py [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]


On re-reading the PEP more closely, it looks like I did misunderstand the 
internal implementation, and strings which fit exactly in Latin-1 will 
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
 
 internal implementation, and strings which fit exactly in Latin-1 will 
 

And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).

We fall on the point I mentionned above. Microsoft know this, ditto
for Apple, ditto for TeX, ditto for the foundries.
Even, ISO has recognized its error and produced iso-8859-15.

The question? Why is it still used?

jmf



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread lipska the kat

On 19/08/12 07:09, Steven D'Aprano wrote:

This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with To recap.


Thank you for this excellent post,
it has certainly cleared up a few things for me

[snip]

incidentally

 But in UTF-16, ...

[snip]

 py  s = chr(0x + 1)
 py  a, b = s
 py  a
 '\ud800'
 py  b
 '\udc00'

in IDLE

Python 3.2.3 (default, May  3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type copyright, credits or license() for more information.
 No Subprocess 
 s = chr(0x + 1)
 a, b = s
Traceback (most recent call last):
  File pyshell#1, line 1, in module
a, b = s
ValueError: need more than 1 value to unpack

At a terminal prompt

[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 s = chr(0x + 1)
 a, b = s
 a
'\ud800'
 b
'\udc00'


The date stamp is different but the Python version is the same

No idea why this is happening, I just thought it was interesting

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
lipskathe...@yahoo.co.uk wrote:
 The date stamp is different but the Python version is the same

Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Mark Lawrence

On 19/08/2012 09:54, wxjmfa...@gmail.com wrote:

About the exemples contested by Steven:

eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are coded. I'm the first
to think, this is legitimate.

Memory or ability to treat all text in the same and equal
way?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf



Roughly translated.  I've been shot to pieces and having seen Monty 
Python and the Holy Grail I know what to do.  Run away, run away


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread lipska the kat

On 19/08/12 11:19, Chris Angelico wrote:

On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
lipskathe...@yahoo.co.uk  wrote:

The date stamp is different but the Python version is the same


Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.


Ah ...

I built my local version from source
and no, I didn't read the makefile so I didn't configure for a wide 
build :-( not that I would have known the difference at that time.


[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import sys
 sys.maxunicode
65535


Later, I did an apt-get install idle3 which pulled
down a precompiled IDLE from the Ubuntu repos
This was obviously compiled 'wide'

Python 3.2.3 (default, May  3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type copyright, credits or license() for more information.
 No Subprocess 
 import sys
 sys.maxunicode
1114111


All very interesting and enlightening

Thanks

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]
 
 if end not near the end of the original string, then this is O(N) even
 with fixed-width representation, because of the char copying.

Technically, yes. But it's a straight copy of a chunk of memory, which 
means it's fast: your OS and hardware tries to make straight memory 
copies as fast as possible. Big-Oh analysis frequently glosses over 
implementation details like that.

Of course, that assumption gets shaky when you start talking about extra 
large blocks, and it falls apart completely when your OS starts paging 
memory to disk.

But if it helps to avoid irrelevant technical details, change it to 
text[end:end+10] or something.


 if it is near the end, by knowing where the string data area ends, I
 think it should be possible to scan backwards from the end, recognizing
 what bytes can be the beginning of code points and counting off the
 appropriate number.  This is O(1) if near the end means within a
 constant.

You know, I think you are misusing Big-Oh analysis here. It really 
wouldn't be helpful for me to say Bubble Sort is O(1) if you only sort 
lists with a single item. Well, yes, that is absolutely true, but that's 
a special case that doesn't give you any insight into why using Bubble 
Sort as your general purpose sort routine is a terrible idea.

Using variable-sized strings like UTF-8 and UTF-16 for in-memory 
representations is a terrible idea because you can't assume that people 
will only every want to index the first or last character. On average, 
you need to scan half the string, one character at a time. In Big-Oh, we 
can ignore the factor of 1/2 and just say we scan the string, O(N).

That's why languages tend to use fixed character arrays for strings. 
Haskell is an exception, using linked lists which require traversing the 
string to jump to an index. The manual even warns:

[quote]
If you think of a Text value as an array of Char values (which it is 
not), you run the risk of writing inefficient code.

An idiom that is common in some languages is to find the numeric offset 
of a character or substring, then use that number to split or trim the 
searched string. With a Text value, this approach would require two O(n) 
operations: one to perform the search, and one to operate from wherever 
the search ended. 
[end quote]

http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 This standard data structure is called UCS-2 ... There's an extension
 to UCS-2 called UTF-16
 
 My own understanding is UCS-2 simply shouldn't be used any more. 

Pretty much. But UTF-16 with lax support for surrogates (that is, 
surrogates are included but treated as two characters) is essentially 
UCS-2 with the restriction against surrogates lifted. That's what Python 
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is 
pretty poor. Even when applications support it, most fonts don't have 
glyphs for the characters. Anything which makes handling of Unicode 
supplementary characters better is a step forward.


 * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
 operations are not O(1) but are O(N). That means they are slow, or
 buggy, pick one.
 
 This I don't see.  What are the basic string operations?

The ones I'm specifically referring to are indexing and copying 
substrings. There may be others.


 * Examine the first character, or first few characters (few = usually
   bounded by a small constant) such as to parse a token from an input
   stream.  This is O(1) with either encoding.

That's actually O(K), for K = a few, whatever a few means. But we 
know that anything is fast for small enough N (or K in this case).


 * Slice off the first N characters.  This is O(N) with either encoding
   if it involves copying the chars.  I guess you could share references
   into the same string, but if the slice reference persists while the
   big reference is released, you end up not freeing the memory until
   later than you really should.

As a first approximation, memory copying is assumed to be free, or at 
least constant time. That's not strictly true, but Big Oh analysis is 
looking at algorithmic complexity. It's not a substitute for actual 
benchmarks.


 Meanwhile, an example of the 393 approach failing: I was involved in a
 project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.

 So
 the chars were mostly ascii, but there would be occasional non-ascii
 chars including supplementary plane characters, either because of
 special symbols that were really in the text, or the typical OCR
 confusion emitting those symbols due to printing imprecision.  That's a
 natural for UTF-8 but the PEP-393 approach would bloat up the memory
 requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single 
string. Then only the pages containing a supplementary plane char will be 
bloated, which is likely to be rare. Especially since I don't expect your 
OCR application would recognise many non-BMP characters -- what does 
U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software 
doesn't recognise it, you can't get it in your output. (If you do, the 
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR 
software not to bother trying to recognise Imperial Aramaic, Domino 
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't 
expecting them in your source material. Not only will the scanning go 
faster, but you'll get fewer wrong characters.


[...]
 I realize the folks who designed and implemented PEP 393 are very smart
 cookies and considered stuff carefully, while I'm just an internet user
 posting an immediate impression of something I hadn't seen before (I
 still use Python 2.6), but I still have to ask: if the 393 approach
 makes sense, why don't other languages do it?

There has to be a first time for everything.


 Ropes of UTF-8 segments seems like the most obvious approach and I
 wonder if it was considered.

Ropes have been considered and rejected because while they are 
asymptotically fast, in common cases the added complexity actually makes 
them slower. Especially for immutable strings where you aren't inserting 
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their 
native string data structure. But that's ropes of *bytes*, not UTF-8.
 
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread DJC

On 19/08/12 15:25, Steven D'Aprano wrote:


Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.


Consider the automated recognition of a CAPTCHA. As the chars have to be 
entered by the user on a keyboard, only the most basic charset can be 
used, so the problem of which chars are possible is quite limited.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 4:54 AM, wxjmfa...@gmail.com wrote:

About the exemples contested by Steven:
eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))
And it is good enough to show the problem. Period.


Repeating a false claim over and over does not make it true. Two people 
on pydev claim that 3.3 is *faster* on their systems (one unspecified, 
one OSX10.8).


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Blind Anagram
Steven D'Aprano  wrote in message 
news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...


On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

[...]
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:

http://bugs.python.org/

Don't forget to report your operating system.


For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
running Windows 7 x64.


Running Python from a Windows command prompt,  I got the following on Python 
3.2.3 and 3.3 beta 2:


python33\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 39.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 51.8 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 52 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 50.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 51.6 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 38.3 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 50.3 usec per loop

python32\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 24.5 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 24.7 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 24.8 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 24 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 24.1 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 24.4 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 24.3 usec per loop

This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
with 3.2.


I am not posting this to perpetuate this thread but simply to ask whether, 
as you suggest, I should report this as a possible problem with the beta?


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 4:04 AM, Paul Rubin wrote:



Meanwhile, an example of the 393 approach failing:


I am completely baffled by this, as this example is one where the 393 
approach potentially wins.



I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii,


3.3 stores ascii pages 1 byte/char rather than 2 or 4.

 but there would be occasional non-ascii

chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.


I doubt that there are really any non-bmp chars. As Steven said, reject 
such false identifications.


 That's a  natural for UTF-8

3.3 would convert to utf-8 for storage on disk.


but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.


3.2- wide builds would *always* use 4 bytes/char. Is not occasionally 
better than always?



 py s = chr(0x + 1)
 py a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.


That looks like a 3.2- narrow build. Such which treat unicode strings as 
sequences of code units rather than sequences of codepoints. Not an 
implementation bug, but compromise design that goes back about a decade 
to when unicode was added to Python. At that time, there were only a few 
defined non-BMP chars and their usage was extremely rare. There are now 
more extended chars than BMP chars and usage will become more common 
even in English text.


Pre 3.3, there are really 2 sub-versions of every Python version: a 
narrow build and a wide build version, with not very well documented 
different behaviors for any string with extended chars. That is and 
would have become an increasing problem as extended chars are 
increasingly used. If you want to say that what was once a practical 
compromise has become a design bug, I would not argue. In any case, 3.3 
fixes that split and returns Python to being one cross-platform language.



I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?


Python has often copied or borrowed, with adjustments. This time it is 
the first. We will see how it goes, but it has been tested for nearly a 
year already.



Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.


I would call it O(k), where k is a selectable constant. Slowing access 
by a factor of 100 is hardly acceptable to me. For strings less than k, 
access is O(len). I believe slicing would require re-indexing.


As 393 was near adoption, I proposed a scheme using utf-16 (narrow 
builds) with a supplementary index of extended chars when there are any. 
That makes access O(1) if there are none and O(log(k)), where k is the 
number of extended chars in the string, if there are some.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :
 Steven D'Aprano  wrote in message 
 
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
 
 
 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
 
 
 
 [...]
 
 If you can consistently replicate a 100% to 1000% slowdown in string
 
 handling, please report it as a performance bug:
 
 
 
 http://bugs.python.org/
 
 
 
 Don't forget to report your operating system.
 
 
 
 
 
 For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
 
 running Windows 7 x64.
 
 
 
 Running Python from a Windows command prompt,  I got the following on Python 
 
 3.2.3 and 3.3 beta 2:
 
 
 
 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 39.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 51.8 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 52 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 50.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 51.6 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 38.3 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 50.3 usec per loop
 
 
 
 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 24.5 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 24.7 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 24.8 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 24 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 24.1 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 24.4 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 24.3 usec per loop
 
 
 
 This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
 
 with 3.2.
 
 
 
 I am not posting this to perpetuate this thread but simply to ask whether, 
 
 as you suggest, I should report this as a possible problem with the beta?

I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Terry Reedy tjre...@udel.edu writes:
 Meanwhile, an example of the 393 approach failing:
 I am completely baffled by this, as this example is one where the 393
 approach potentially wins.

What?  The 393 approach is supposed to avoid memory bloat and that
does the opposite.

 I was involved in a project that dealt with terabytes of OCR data of
 mostly English text.  So the chars were mostly ascii,
 3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.

 I doubt that there are really any non-bmp chars.

You may be right about this.  I thought about it some more after
posting and I'm not certain that there were supplemental characters.

 As Steven said, reject such false identifications.

Reject them how?

 That's a  natural for UTF-8
 3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.  

 but the PEP-393 approach would bloat up the memory
 requirements by a factor of 4.
 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
 better than always?

The bloat is in comparison with utf-8, in that example.

 That looks like a 3.2- narrow build. Such which treat unicode strings
 as sequences of code units rather than sequences of codepoints. Not an
 implementation bug, but compromise design that goes back about a
 decade to when unicode was added to Python. 

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list.  So I'm surprised they seem to
repeated a mistake there.  

 I would call it O(k), where k is a selectable constant. Slowing access
 by a factor of 100 is hardly acceptable to me. 

If k is constant then O(k) is the same as O(1).  That is how O notation
works.  I wouldn't believe the 100x figure without seeing it measured in
real-world applications.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
 There is some additional benefit for Latin-1 users, but this has nothing
 to do with Python.  If Python is going to have the option of a 1-byte
 representation (and as long as we have the flexible representation, I
 can see no reason not to),

 The PEP explicitly states that it only uses a 1-byte format for ASCII
 strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type help, copyright, credits or license for more information.
 import sys
 sys.getsizeof(bytes(range(256)).decode('latin1'))
329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes.  It is not, so I think it must be using
the 1-byte encoding.


 ASCII-only Unicode strings will again use only one byte per character

This says nothing one way or the other about non-ASCII Latin-1 strings.

 If the maximum character is less than 128, they use the PyASCIIObject
 structure

Note that this only describes the structure of compact string
objects, which I have to admit I do not fully understand from the PEP.
 The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures.  It then says that for compact ASCII
strings the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data.  But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

 and:

 The data and utf8 pointers point to the same memory if the string uses
 only ASCII characters (using only Latin-1 is not sufficient).

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory.  It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Blind Anagram
wrote in message 
news:5dfd1779-9442-4858-9161-8f1a06d56...@googlegroups.com...


Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :

Steven D'Aprano  wrote in message

news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...



On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:



[...]

If you can consistently replicate a 100% to 1000% slowdown in string

handling, please report it as a performance bug:



http://bugs.python.org/



Don't forget to report your operating system.





For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)

running Windows 7 x64.



Running Python from a Windows command prompt,  I got the following on 
Python


3.2.3 and 3.3 beta 2:



python33\python -m timeit ('abc' * 1000).replace('c', 'de')

1 loops, best of 3: 39.3 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', '……')

1 loops, best of 3: 51.8 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')

1 loops, best of 3: 52 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')

1 loops, best of 3: 50.3 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', '€…')

1 loops, best of 3: 51.6 usec per loop

python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')

1 loops, best of 3: 38.3 usec per loop

python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')

1 loops, best of 3: 50.3 usec per loop



python32\python -m timeit ('abc' * 1000).replace('c', 'de')

1 loops, best of 3: 24.5 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', '……')

1 loops, best of 3: 24.7 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')

1 loops, best of 3: 24.8 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')

1 loops, best of 3: 24 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', '€…')

1 loops, best of 3: 24.1 usec per loop

python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')

1 loops, best of 3: 24.4 usec per loop

python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')

1 loops, best of 3: 24.3 usec per loop



This is an average slowdown by a factor of close to 2.3 on 3.3 when 
compared


with 3.2.



I am not posting this to perpetuate this thread but simply to ask whether,

as you suggest, I should report this as a possible problem with the beta?


I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.


I may be reading your input wrongly, but it seems to me that you are not 
only reporting a slowdown but you are also suggesting that this slowdown is 
the result of bad design decisions by the Python development team.


I don't want to get involved in the latter part of your argument because I 
am convinced that the Python team are doing their very best to find a good 
compromise between the various design constraints that they face in meeting 
these needs.


Nevertheless, the post that I responded to contained the suggestion that 
slowdowns above 100% (which I took as a factor of 2) would be worth 
reporting as a possible bug.  So I thought that it was worth asking about 
this as I may have misunderstood the level of slowdown that is worth 
reporting.  There is also a potential problem in timings on laptops with 
turbo-boost (as I have), although the times look fairly consistent.


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Dave Angel
On 08/19/2012 01:03 PM, Blind Anagram wrote:
 Steven D'Aprano  wrote in message
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...

 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

 [...]
 If you can consistently replicate a 100% to 1000% slowdown in string
 handling, please report it as a performance bug:

 http://bugs.python.org/

 Don't forget to report your operating system.

 
 For interest, I ran your code snippets on my laptop (Intel core-i7
 1.8GHz) running Windows 7 x64.

 Running Python from a Windows command prompt,  I got the following on
 Python 3.2.3 and 3.3 beta 2:

 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 1 loops, best of 3: 39.3 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 1 loops, best of 3: 51.8 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 1 loops, best of 3: 52 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 1 loops, best of 3: 50.3 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 1 loops, best of 3: 51.6 usec per loop
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 1 loops, best of 3: 38.3 usec per loop
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 1 loops, best of 3: 50.3 usec per loop

 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 1 loops, best of 3: 24.5 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 1 loops, best of 3: 24.7 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 1 loops, best of 3: 24.8 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 1 loops, best of 3: 24 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 1 loops, best of 3: 24.1 usec per loop
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 1 loops, best of 3: 24.4 usec per loop
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 1 loops, best of 3: 24.3 usec per loop

 This is an average slowdown by a factor of close to 2.3 on 3.3 when
 compared with 3.2.


Using your measurement numbers, I get an average of 1.95, not 2.3



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
 
 
 But they are not ascii pages, they are (as stated) MOSTLY ascii.
 
 E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
 
 a much more memory-expensive encoding than UTF-8.
 
 

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.

 '€'.encode('cp1252')
b'\x80'
 '€'.encode('mac-roman')
b'\xdb'
 '€'.encode('iso-8859-1')
Traceback (most recent call last):
  File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' 
in position 0: ordinal not in range(256)

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Ian Kelly ian.g.ke...@gmail.com writes:
 sys.getsizeof(bytes(range(256)).decode('latin1'))
 329

Please try:

   print (type(bytes(range(256)).decode('latin1')))

to make sure that what comes back is actually a unicode string rather
than a byte string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin no.email@nospam.invalid wrote:
 Ian Kelly ian.g.ke...@gmail.com writes:
 sys.getsizeof(bytes(range(256)).decode('latin1'))
 329

 Please try:

print (type(bytes(range(256)).decode('latin1')))

 to make sure that what comes back is actually a unicode string rather
 than a byte string.

As I understand it, the decode method never returns a byte string in
Python 3, but if you insist:

 print (type(bytes(range(256)).decode('latin1')))
class 'str'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly ian.g.ke...@gmail.com wrote:
 Note that this only describes the structure of compact string
 objects, which I have to admit I do not fully understand from the PEP.
  The wording suggests that it only uses the PyASCIIObject structure,
 not the derived structures.  It then says that for compact ASCII
 strings the UTF-8 data, the UTF-8 length and the wstr length are the
 same as the length of the ASCII data.  But these fields are part of
 the PyCompactUnicodeObject structure, not the base PyASCIIObject
 structure, so they would not exist if only PyASCIIObject were used.
 It would also imply that compact non-ASCII strings are stored
 internally as UTF-8, which would be surprising.

Oh, now I get it.  I had missed the part where it says character data
immediately follow the base structure.  And the bit about the UTF-8
data, the UTF-8 length and the wstr length are not describing the
contents of those fields, but rather where the data can be alternatively
found since the fields don't exist.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Mark Lawrence

On 19/08/2012 19:11, wxjmfa...@gmail.com wrote:

Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :



But they are not ascii pages, they are (as stated) MOSTLY ascii.

E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses

a much more memory-expensive encoding than UTF-8.




Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.


'€'.encode('cp1252')

b'\x80'

'€'.encode('mac-roman')

b'\xdb'

'€'.encode('iso-8859-1')

Traceback (most recent call last):
   File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

jmf



Well that's it then, the world stock markets will all collapse tonight 
when the news leaks out that those stupid Americans haven't yet realised 
that much of Europe (with at least one very noticeable and sensible 
exception :) uses Euros.  I'd better sell all my stock holdings fast.


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Ian Kelly ian.g.ke...@gmail.com writes:
 print (type(bytes(range(256)).decode('latin1')))
 class 'str'

Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote:

 Terry Reedy tjre...@udel.edu writes:

 I would call it O(k), where k is a selectable constant. Slowing access
 by a factor of 100 is hardly acceptable to me.
 
 If k is constant then O(k) is the same as O(1).  That is how O notation
 works.

You might as well say that if N is constant, O(N**2) is constant too and 
just like magic you have now made Bubble Sort a constant-time sort 
function!

That's not how it works.

Of course *if* k is constant, O(k) is constant too, but k is not 
constant. In context we are talking about string indexing and slicing. 
There is no value of k, say, k = 2, for which you can say People will 
sometimes ask for string[2] but never ask for string[3]. That is absurd.

Since k can vary from 0 to N-1, we can say that the average string index 
lookup is k = (N-1)//2 which clearly depends on N.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote:

 On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
[...]
 The PEP explicitly states that it only uses a 1-byte format for ASCII
 strings, not Latin-1:
 
 I think you misunderstand the PEP then, because that is empirically
 false.

Yes I did misunderstand. Thank you for the clarification.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 18:03:34 +0100, Blind Anagram wrote:

 Steven D'Aprano  wrote in message
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
  If you can consistently replicate a 100% to 1000% slowdown in string
  handling, please report it as a performance bug:
  
  http://bugs.python.org/
  
  Don't forget to report your operating system.

[...]

 This is an average slowdown by a factor of close to 2.3 on 3.3 when
 compared with 3.2.
 
 I am not posting this to perpetuate this thread but simply to ask
 whether, as you suggest, I should report this as a possible problem with
 the beta?

Possibly, if it is consistent and non-trivial. Serious performance 
regressions are bugs. Trivial ones, not so much.

Thanks to Terry Reedy, who has already asked the Python Devs about this 
issue, they have made it clear that they aren't hugely interested in 
micro-benchmarks in isolation. If you want the bug report to be taken 
seriously, you would need to run the full Python string benchmark. The 
results of that would be interesting to see.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 1:03 PM, Blind Anagram wrote:


Running Python from a Windows command prompt,  I got the following on
Python 3.2.3 and 3.3 beta 2:

python33\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 39.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 51.8 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 52 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 50.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 51.6 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 38.3 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 50.3 usec per loop

python32\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 24.5 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 24.7 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 24.8 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 24 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 24.1 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 24.4 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 24.3 usec per loop


This is one test repeated 7 times with essentially irrelevant 
variations. The difference is less on my system (50%). Others report 
seeing 3.3 as faster. When I asked on pydev, the answer was don't bother 
making a tracker issue unless I was personally interested in 
investigating why search is relatively slow in 3.3 on Windows. Any 
change would have to not slow other operations or severely impact search 
on other systems. I suggest the same answer to you.


If you seriously want to compare old and new unicode, go to
http://hg.python.org/cpython/file/tip/Tools/stringbench/stringbench.py
and click raw to download. Run on 3.2 and 3.3, ignoring the bytes times.

Here is a version of the first comparison from stringbench:
print(timeit('''('NOW IS THE TIME FOR ALL GOOD PEOPLE TO COME TO THE AID 
OF PYTHON'* 10).lower()'''))

Results are 5.6 for 3.2 and .8 for 3.3. WOW! 3.3 is 7 times faster!

OK, not fair. I cherry picked. The 7 times speedup in 3.3 likely is at 
least partly independent of the 393 unicode change. The same test in 
stringbench for bytes is twice as fast in 3.3 as 3.2, but only 2x, not 
7x. In fact, it may have been the bytes/unicode comparison in 3.2 that 
suggested that unicode case conversion of ascii chrs might be made faster.


The sum of the 3.3 unicode times is 109 versus 110 for 3.3 bytes and 125 
for 3.2 unicode. This unweighted sum is not really fair since the raw 
times vary by a factor of at least 100. But is does suggest that anyone 
claiming that 3.3 unicode is overall 'slower' than 3.2 unicode has some 
work to do.


There is also this. On my machine, the lowest bytes-time/unicode-time 
for 3.3 is .71. This suggests that there is not a lot of fluff left in 
the unicode code, and that not much is lost by the bytes to unicode 
switch for strings.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote:


Well, it seems some software producers know what they
are doing.


'€'.encode('cp1252')

b'\x80'

'€'.encode('mac-roman')

b'\xdb'

'€'.encode('iso-8859-1')

Traceback (most recent call last):
   File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)


Yes, Python lets you choose your byte encoding from those and a hundred 
others. I believe all the codecs are now tested in both directions. It 
was not an easy task.


As to the examples: Latin-1 dates to 1985 and before and the 1988 
version was published as a standard in 1992.

https://en.wikipedia.org/wiki/Latin-1
The name euro was officially adopted on 16 December 1995.
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International 
standards organizations standards are relatively fixed. (The unicode 
consortium will not even correct misspelled character names.) Instead, 
new standards with a new number are adopted.


For better or worse, private mappings are more flexible. In its Mac 
mapping Apple replaced the generic currency sign ¤ with the euro sign 
€. (See Latin-1 reference.) Great if you use Euros, not so great if you 
were using the previous sign for something else.


Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
It is very common to mislabel Windows-1252 text with the charset label 
ISO-8859-1. A common result was that all the quotes and apostrophes 
(produced by smart quotes in Microsoft software) were replaced with 
question marks or boxes on non-Windows operating systems, making text 
difficult to read. Most modern web browsers and e-mail clients treat the 
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such 
mislabeling. This is now standard behavior in the draft HTML 5 
specification, which requires that documents advertised as ISO-8859-1 
actually be parsed with the Windows-1252 encoding.[1]


Lots of fun. Too bad Microsoft won't push utf-8 so we can all 
communicate text with much less chance of ambiguity.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote:
 On 8/19/2012 4:04 AM, Paul Rubin wrote:
 I realize the folks who designed and implemented PEP 393 are very smart
 cookies and considered stuff carefully, while I'm just an internet user
 posting an immediate impression of something I hadn't seen before (I
 still use Python 2.6), but I still have to ask: if the 393 approach
 makes sense, why don't other languages do it?

 Python has often copied or borrowed, with adjustments. This time it is the
 first. We will see how it goes, but it has been tested for nearly a year
 already.

Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are 256, the string width is 8 (measured in bits);
if 65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a 255 codepoint - or perhaps worse, when you
have a 127x256, and the other end misinterprets it.

Really, the only viable alternative to PEP 393 is a fixed 32-bit
representation - it's the only way that's guaranteed to provide
equivalent semantics. The new storage format is guaranteed to take no
more memory than that, and provide equivalent functionality.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Roy Smith
In article mailman.3531.1345416176.4697.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 Really, the only viable alternative to PEP 393 is a fixed 32-bit
 representation - it's the only way that's guaranteed to provide
 equivalent semantics. The new storage format is guaranteed to take no
 more memory than that, and provide equivalent functionality.

In the primordial days of computing, using 8 bits to store a character 
was a profligate waste of memory.  What on earth did people need with 
TWO cases of the alphabet (not to mention all sorts of weird 
punctuation)?  Eventually, memory became cheap enough that the 
convenience of using one character per byte (not to mention 8-bit bytes) 
outweighed the costs.  And crazy things like sixbit and rad-50 got swept 
into the dustbin of history.

So it may be with utf-8 someday.

Clearly, the world has moved to a 32-bit character set.  Not all parts 
of the world know that yet, or are willing to admit it, but that doesn't 
negate the fact that it's true.  Equally clearly, the concept of one 
character per byte is a big win.  The obvious conclusion is that 
eventually, when memory gets cheap enough, we'll all be doing utf-32 and 
all this transcoding nonsense will look as antiquated as rad-50 does 
today.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread 88888 Dihedral
On Monday, August 20, 2012 1:03:34 AM UTC+8, Blind Anagram wrote:
 Steven D'Aprano  wrote in message 
 
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
 
 
 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
 
 
 
 [...]
 
 If you can consistently replicate a 100% to 1000% slowdown in string
 
 handling, please report it as a performance bug:
 
 
 
 http://bugs.python.org/
 
 
 
 Don't forget to report your operating system.
 
 
 
 
 
 For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
 
 running Windows 7 x64.
 
 
 
 Running Python from a Windows command prompt,  I got the following on Python 
 
 3.2.3 and 3.3 beta 2:
 
 
 
 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 39.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 51.8 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 52 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 50.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 51.6 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 38.3 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 50.3 usec per loop
 
 
 
 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 24.5 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 24.7 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 24.8 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 24 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 24.1 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 24.4 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 24.3 usec per loop
 
 
 
 This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
 
 with 3.2.
 
 
 
 I am not posting this to perpetuate this thread but simply to ask whether, 
 
 as you suggest, I should report this as a possible problem with the beta?

Un, another set of functions for seeding up ASCII string othe pertions 
might be needed. But it is better that Python 3.3 supports unicode strings
to be easy to be used by people in different languages first.

Anyway I think Cython and Pyrex can be used to tackle this problem.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 6:42 PM, Chris Angelico wrote:

On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote:



Python has often copied or borrowed, with adjustments. This time it is the
first.


I should have added 'that I know of' ;-)


Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are 256, the string width is 8 (measured in bits);
if 65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)


It is even possible that someone involved was even vaguely aware that 
there was an antecedent. The PEP makes no claim that I can see, but lays 
out the problem and goes right to details of a Python implementation.



However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a 255 codepoint - or perhaps worse, when you
have a 127x256, and the other end misinterprets it.


Python writes strings to file objects, including open sockets, without 
creating a bytes object -- IF the file is opened in text mode, which 
always has an associated encoding, even if the default 'ascii'. From 
what you say, this is what Pike is missing.


I am pretty sure that the obvious optimization has already been done. 
The internal bytes of all-ascii text can safely be sent to a file with 
ascii (or ascii-compatible) encoding without intermediate 'decoding'. I 
remember several patches of that sort. If a string is internally ucs2 
and the file is declared usc2 or utf-16 encoding, then again, pairs of 
bytes can go directly (possibly with a byte swap).



--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Mon, Aug 20, 2012 at 10:35 AM, Terry Reedy tjre...@udel.edu wrote:
 On 8/19/2012 6:42 PM, Chris Angelico wrote:
 However, Python goes a bit further by making it VERY clear that this
 is a mere optimization, and that Unicode strings and bytes strings are
 completely different beasts. In Pike, it's possible to forget to
 encode something before (say) writing it to a socket. Everything works
 fine while you have only ASCII characters in the string, and then
 breaks when you have a 255 codepoint - or perhaps worse, when you
 have a 127x256, and the other end misinterprets it.

 Python writes strings to file objects, including open sockets, without
 creating a bytes object -- IF the file is opened in text mode, which always
 has an associated encoding, even if the default 'ascii'. From what you say,
 this is what Pike is missing.

In text mode, the library does the encoding, but an encoding still happens.

 I am pretty sure that the obvious optimization has already been done. The
 internal bytes of all-ascii text can safely be sent to a file with ascii (or
 ascii-compatible) encoding without intermediate 'decoding'. I remember
 several patches of that sort. If a string is internally ucs2 and the file is
 declared usc2 or utf-16 encoding, then again, pairs of bytes can go directly
 (possibly with a byte swap).

Maybe it doesn't take any memory change, but there is a data type
change. A Unicode string cannot be sent over the network; an encoding
is needed.

In Pike, I can take a string like \x20AC (or \u20ac or
\U20ac, same thing) and manipulate it as a one-character string,
but I cannot write it to a file or file-like object. I can, however,
pass it through a codec (and there's string_to_utf8() for the
convenience of the common case), and get back something like
\xe2\x82\xac, which is a three-byte string. The thing is, though,
that this new string is of exactly the same data type as the original:
'string'. Which means that I could have a string containing Latin-1
but not ASCII characters, and Pike will happily write it to a socket
without raising a compile-time or run-time error. Python, under the
same circumstances, would either raise an error or quietly (and
correctly) encode the data.

But this is a relatively trivial point, in the scheme of things.
Python has an excellent model now for handling Unicode strings, and I
would STRONGLY recommend everyone to upgrade to 3.3.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 19:24:30 -0400, Roy Smith wrote:

 In the primordial days of computing, using 8 bits to store a character
 was a profligate waste of memory.  What on earth did people need with
 TWO cases of the alphabet 

That's obvious, surely? We need two cases so that we can distinguish 
helping Jack off a horse from helping jack off a horse.


 (not to mention all sorts of weird
 punctuation)?  Eventually, memory became cheap enough that the
 convenience of using one character per byte (not to mention 8-bit bytes)
 outweighed the costs.  And crazy things like sixbit and rad-50 got swept
 into the dustbin of history.

8 bit bytes are much older than 8 bit characters. For a long time, ASCII 
characters used only 7 bits out of the 8.


 So it may be with utf-8 someday.

Only if you believe that people's ability to generate data will remain 
lower than people's ability to install more storage.

Every few years, new sizes for storage media comes out. The first thing 
that happens is that people say 40 megabytes? I'll NEVER fill this hard 
drive up!. The second thing that happens is that they say Dammit, my 40 
MB hard drive is full, and a new one is too expensive, better delete some 
files. Followed shortly by 400 megabytes? I'll NEVER use that much 
space! -- wash, rinse, repeat, through megabytes, gigabytes, terrabytes, 
and it will happen for petabytes next.

So long as our ability to outstrip storage continues, compression and 
memory-efficient storage schemes will remain in demand.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Roy Smith
In article 5031bb2f$0$29972$c3e8da3$54964...@news.astraweb.com,
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:

  So it may be with utf-8 someday.
 
 Only if you believe that people's ability to generate data will remain 
 lower than people's ability to install more storage.

We're not talking *data*, we're talking *text*.  Most of those 
whatever-bytes people are generating are images, video, and music.  Text 
is a pittance compared to those.

In any case, text on disk can easily be stored compressed.  I would 
expect the UTF-8 and UTF-32 versions of a text file to compress to just 
about the same size.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
 sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
 timeit.timeit(('ab…' * 1000).replace('…', '……'))
37.32762490493721
timeit.timeit(('ab…' * 10).replace('…', 'œ…'))
0.8158757139801764

 sys.version
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit 
(Intel)]'
 imeit.timeit(('ab…' * 1000).replace('…', '……'))
61.919225272152346
 timeit.timeit(('ab…' * 10).replace('…', 'œ…'))
1.2918679017971044

timeit.timeit(('ab…' * 10).replace('…', '€…'))
1.2484133226156757

* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.

* Bad luck, such characters are usual characters in French scripts
(and in some other European language).

* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.

My take of the subject.

This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the BMP limit, the tools are here. They are called utf-8 or
ucs-4/utf-32.

One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this Let's go with ucs-4, and the problems
are solved for ever. He was so right.

I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
ascii users and non ascii users (see the unicode
literal reintroduction).

Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.

PS Py3.3b2 is still crashing, silently exiting, with
cp65001.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano
On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

 sys.version
 '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
 timeit.timeit(('ab…' * 1000).replace('…', '……'))
 37.32762490493721
 timeit.timeit(('ab…' * 10).replace('…', 'œ…')) 0.8158757139801764
 
 sys.version
 '3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
 bit (Intel)]'
 imeit.timeit(('ab…' * 1000).replace('…', '……'))
 61.919225272152346

imeit?

It is hard to take your results seriously when you have so obviously 
edited your timing results, not just copied and pasted them.


Here are my results, on my laptop running Debian Linux. First, testing on 
Python 3.2:

steve@runes:~$ python3.2 -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit ('XYZ' * 1000).replace('Y', 'πЖ')
1 loops, best of 3: 49.7 usec per loop


As you can see, the timing results are all consistently around 50 
microseconds per loop, regardless of which characters I use, whether they 
are in Latin-1 or not. The differences between one test and another are 
not meaningful.


Now I do them again using Python 3.3:

steve@runes:~$ python3.3 -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit ('XYZ' * 1000).replace('Y', 'πЖ')
1 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's  
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string 
handling, please report it as a performance bug:


http://bugs.python.org/

Don't forget to report your operating system.



 My take of the subject.
 
 This is a typical Python desease. Do not solve a problem, but find a
 way, a workaround, which is expecting to solve a problem and which
 finally solves nothing. As far as I know, to break the BMP limit, the
 tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes. 
Every. Single. One.

So under UCS-4, the pure-ascii string hello world takes 44 bytes plus 
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but 
of course UCS-2 can only represent characters in the BMP. A pure ASCII 
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
using two code points. This is fragile and doesn't work very well, 
because string-handling methods can break the surrogate pairs apart, 
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not 
very important, but when you double the memory required for every single 
string it becomes huge. Remember that every class, function and method 
has a name, which is a string; every attribute and variable has a name, 
all strings; functions and classes have doc strings, all strings. Strings 
are used everywhere in Python, and doubling the memory needed by Python 
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient 
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte 
per character;

- if it only contains characters in the BMP, it will be stored using 
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using 
UCS-4 (4 bytes per character).



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
 [...]
 The problem with UCS-4 is that every character requires four bytes. 
 [...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Ian Kelly
(Resending this to the list because I previously sent it only to
Steven by mistake.  Also showing off a case where top-posting is
reasonable, since this bit requires no context. :-)

On Sat, Aug 18, 2012 at 1:41 AM, Ian Kelly ian.g.ke...@gmail.com wrote:

 On Aug 17, 2012 10:17 PM, Steven Dapos;Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:

 Unicode strings are not represented as Latin-1 internally. Latin-1 is a
 byte encoding, not a unicode internal format. Perhaps you mean to say
 that they are represented as a single byte format?

 They are represented as a single-byte format that happens to be equivalent
 to Latin-1, because Latin-1 is a proper subset of Unicode; every character
 representable in Latin-1 has a byte value equal to its Unicode codepoint.
 This talk of whether it's a byte encoding or a 1-byte Unicode representation
 is then just semantics. Even the PEP refers to the 1-byte representation as
 Latin-1.


  I understand the complaint
  to be that while the change is great for strings that happen to fit in
  Latin-1, it is less efficient than previous versions for strings that
  do not.
 
  That's not the way I interpreted the PEP 393.  It takes a pure unicode
  string, finds the largest code point in that string, and chooses 1, 2 or
  4 bytes for every character, based on how many bits it'd take for that
  largest code point.

 That's how I interpret it too.

 I don't see how this is any different from what I described. Using all 4
 bytes of the code point, you get UCS-4. Truncating to 2 bytes, you get
 UCS-2. Truncating to 1 byte, you get Latin-1.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence

On 18/08/2012 16:07, wxjmfa...@gmail.com wrote:

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :

[...]
The problem with UCS-4 is that every character requires four bytes.
[...]


I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf



Sorry but you've got me completely baffled.  Could you please explain in 
words of one syllable or less so I can attempt to grasp what the hell 
you're on about?


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 1:07 AM,  wxjmfa...@gmail.com wrote:
 I'm aware of this (and all the blah blah blah you are
 explaining). This always the same song. Memory.

 Let me ask. Is Python an 'american product for us-users
 or is it a tool for everybody [*]?
 Is there any reason why non ascii users are somehow penalized
 compared to ascii users?

Regardless of your own native language, len is the name of a popular
Python function. And dict is a well-used class. Both those names are
representable in ASCII, even if every quoted string in your code
requires more bytes to store.

And memory usage has significance in many other areas, too. CPU cache
utilization turns a space saving into a time saving. That's why
structure packing still exists, even though member alignment has other
advantages.

You'd be amazed how many non-USA strings still fit inside seven bits,
too. Are you appending a space to something? Splitting on newlines?
You'll have lots of strings that are going now to be space-optimized.
Of course, the performance gains from shortening some of the strings
may be offset by costs when comparing one-byte and multi-byte strings,
but presumably that's all been gone into in great detail elsewhere.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Ian Kelly
On Sat, Aug 18, 2012 at 9:07 AM,  wxjmfa...@gmail.com wrote:
 Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
 [...]
 The problem with UCS-4 is that every character requires four bytes.
 [...]

 I'm aware of this (and all the blah blah blah you are
 explaining). This always the same song. Memory.

 Let me ask. Is Python an 'american product for us-users
 or is it a tool for everybody [*]?
 Is there any reason why non ascii users are somehow penalized
 compared to ascii users?

The change does not just benefit ASCII users.  It primarily benefits
anybody using a wide unicode build with strings mostly containing only
BMP characters.  Even for narrow build users, there is the benefit
that with approximately the same amount of memory usage in most cases,
they no longer have to worry about non-BMP characters sneaking in and
breaking their code.

There is some additional benefit for Latin-1 users, but this has
nothing to do with Python.  If Python is going to have the option of a
1-byte representation (and as long as we have the flexible
representation, I can see no reason not to), then it is going to be
Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if
you will) is.  If you have an issue with that, take it up with the
designers of Unicode.


 This flexible string representation is a regression (ascii users
 or not).

 I recognize in practice the real impact is for many users
 closed to zero (including me) but I have shown (I think) that
 this flexible representation is, by design, not as optimal
 as it is supposed to be. This is in my mind the relevant point.

You've shown nothing of the sort.  You've demonstrated only one out of
many possible benchmarks, and other users on this list can't even
reproduce that.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.

Now, the reason. I think it is due the flexible represention.

Deeper reason. The boss do not wish to hear from a (pure)
ucs-4/utf-32 engine (this has been discussed I do not know
how many times).

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 2:38 AM,  wxjmfa...@gmail.com wrote:
 Sorry guys, I'm not stupid (I think). I can open IDLE with
 Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
 always slower. Period.

Ah, but what about all those other operations that use strings under
the covers? As mentioned, namespace lookups do, among other things.
And how is performance in the (very real) case where a C routine wants
to return a value to Python as a string, where the data is currently
guaranteed to be ASCII (previously using PyUnicode_FromString, now
able to use PyUnicode_FromKindAndData)? Again, I'm sure this has been
gone into in great detail before the PEP was accepted (am I
negative-bikeshedding here? atomic reactoring???), and I'm sure that
the gains outweigh the costs.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence

On 18/08/2012 17:38, wxjmfa...@gmail.com wrote:

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.


Proof that is acceptable to everybody please, not just yourself.



Now, the reason. I think it is due the flexible represention.

Deeper reason. The boss do not wish to hear from a (pure)
ucs-4/utf-32 engine (this has been discussed I do not know
how many times).

jmf



--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Steven D'Aprano
On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

 Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
 [...]
 The problem with UCS-4 is that every character requires four bytes.
 [...]
 
 I'm aware of this (and all the blah blah blah you are explaining). This
 always the same song. Memory.

Exactly. The reason it is always the same song is because it is an 
important song.

 
 Let me ask. Is Python an 'american product for us-users or is it a tool
 for everybody [*]?

It is a product for everyone, which is exactly why PEP 393 is so 
important. PEP 393 means that users who have only a few non-BMP 
characters don't have to pay the cost of UCS-4 for every single string in 
their application, only for the ones that actually require it. PEP 393 
means that using Unicode strings is now cheaper for everybody.

You seem to be arguing that the way forward is not to make Unicode 
cheaper for everyone, but to make ASCII strings more expensive so that 
everyone suffers equally. I reject that idea.


 Is there any reason why non ascii users are somehow penalized compared
 to ascii users?

Of course there is a reason.

If you want to represent 1114111 different characters in a string, as 
Unicode supports, you can't use a single byte per character, or even two 
bytes. That is a fact of basic mathematics. Supporting 1114111 characters 
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because 
someday you *might* need a non-BMP character?



 This flexible string representation is a regression (ascii users or
 not).

No it is not. It is a great step forward to more efficient Unicode.

And it means that now Python can correctly deal with non-BMP characters 
without the nonsense of UTF-16 surrogates:

steve@runes:~$ python3.3 -c print(len(chr(1114000)))  # Right!
1
steve@runes:~$ python3.2 -c print(len(chr(1114000)))  # Wrong!
2

without doubling the storage of every string.

This is an important step towards making the full range of Unicode 
available more widely.

 
 I recognize in practice the real impact is for many users closed to zero

Then what's the problem?


 (including me) but I have shown (I think) that this flexible
 representation is, by design, not as optimal as it is supposed to be.

You have not shown any real problem at all. 

You have shown untrustworthy, edited timing results that don't match what 
other people are reporting.

Even if your timing results are genuine, you haven't shown that they make 
any difference for real code that does useful work.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :
 
 Proof that is acceptable to everybody please, not just yourself.
 
 
I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an a, internally the text is
saved as ascii, then I type en é, the text can only
be saved in at least latin-1. Then I enter an €, the text
become an internal ucs-4 string. The remove the € and so
on.

Intuitively I expect there is some kind slow down between
all these strings conversion.

When I tested this flexible representation, a few months
ago, at the first alpha release. This is precisely what,
I tested. String manipulations which are forcing this internal
change and I concluded the result is not brillant. Realy,
a factor 0.n up to 10.

This are simply my conclusions.

Related question.

Does any body know a way to get the size of the internal
string in bytes? In the narrow or wide build it is easy,
I can encode with the unicode_internal codec. In Py 3.3, 
I attempted to toy with sizeof and stuct, but without
success.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 (There is an extension to UCS-2, UTF-16, which encodes non-BMP characters 
 using two code points. This is fragile and doesn't work very well, 
 because string-handling methods can break the surrogate pairs apart, 
 leaving you with invalid unicode string. Not good.)
...
 With PEP 393, each Python string will be stored in the most efficient 
 format possible:

Can you explain the issue of breaking surrogate pairs apart a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread MRAB

On 18/08/2012 19:05, wxjmfa...@gmail.com wrote:

Le samedi 18 août 2012 19:28:26 UTC+2, Mark Lawrence a écrit :


Proof that is acceptable to everybody please, not just yourself.



I cann't, I'm only facing the fact it works slower on my
Windows platform.

As I understand (I think) the undelying mechanism, I
can only say, it is not a surprise that it happens.

Imagine an editor, I type an a, internally the text is
saved as ascii, then I type en é, the text can only
be saved in at least latin-1. Then I enter an €, the text
become an internal ucs-4 string. The remove the € and so
on.


[snip]

a will be stored as 1 byte/codepoint.

Adding é, it will still be stored as 1 byte/codepoint.

Adding €, it will still be stored as 2 bytes/codepoint.

But then you wouldn't be adding them one at a time in Python, you'd be
building a list and then joining them together in one operation.
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :
 On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
 
 
 
  Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
 
  [...]
 
  The problem with UCS-4 is that every character requires four bytes.
 
  [...]
 
  
 
  I'm aware of this (and all the blah blah blah you are explaining). This
 
  always the same song. Memory.
 
 
 
 Exactly. The reason it is always the same song is because it is an 
 
 important song.
 
 
No offense here. But this is an *american* answer.

The same story as the coding of text files, where utf-8 == ascii
and the rest of the world doesn't count.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread rusi
On Aug 18, 10:59 pm, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:
 On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
  Is there any reason why non ascii users are somehow penalized compared
  to ascii users?

 Of course there is a reason.

 If you want to represent 1114111 different characters in a string, as
 Unicode supports, you can't use a single byte per character, or even two
 bytes. That is a fact of basic mathematics. Supporting 1114111 characters
 must be more expensive than supporting 128 of them.

 But why should you carry the cost of 4-bytes per character just because
 someday you *might* need a non-BMP character?

I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread MRAB

On 18/08/2012 19:26, Paul Rubin wrote:

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)

...

With PEP 393, each Python string will be stored in the most efficient
format possible:


Can you explain the issue of breaking surrogate pairs apart a little
more?  Switching between encodings based on the string contents seems
silly at first glance.  Strings are immutable so I don't understand why
not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
Latin-based alphabets and UTF-16 may be more efficient for some other
languages.  I think even UCS-4 doesn't completely fix the surrogate pair
issue if it means the only thing I can think of.


On a narrow build, codepoints outside the BMP are stored as a surrogate
pair (2 codepoints). On a wide build, all codepoints can be represented
without the need for surrogate pairs.

The problem with strings containing surrogate pairs is that you could
inadvertently slice the string in the middle of the surrogate pair.
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence

On 18/08/2012 19:30, wxjmfa...@gmail.com wrote:

Le samedi 18 août 2012 19:59:18 UTC+2, Steven D'Aprano a écrit :

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:




Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :



[...]



The problem with UCS-4 is that every character requires four bytes.



[...]







I'm aware of this (and all the blah blah blah you are explaining). This



always the same song. Memory.




Exactly. The reason it is always the same song is because it is an

important song.



No offense here. But this is an *american* answer.

The same story as the coding of text files, where utf-8 == ascii
and the rest of the world doesn't count.

jmf



Thinking about it I entirely agree with you.  Steven D'Aprano strikes me 
as typically American, in the same way that I'm typically Brazilian :)


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence

On 18/08/2012 19:40, rusi wrote:

On Aug 18, 10:59 pm, Steven D'Aprano steve
+comp.lang.pyt...@pearwood.info wrote:

On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:

Is there any reason why non ascii users are somehow penalized compared
to ascii users?


Of course there is a reason.

If you want to represent 1114111 different characters in a string, as
Unicode supports, you can't use a single byte per character, or even two
bytes. That is a fact of basic mathematics. Supporting 1114111 characters
must be more expensive than supporting 128 of them.

But why should you carry the cost of 4-bytes per character just because
someday you *might* need a non-BMP character?


I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605

Original above does not open for me but here's a copy that does:

http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html



ROFLMAO doesn't adequately some up how much I laughed.

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Terry Reedy

On 8/18/2012 12:38 PM, wxjmfa...@gmail.com wrote:

Sorry guys, I'm not stupid (I think). I can open IDLE with
Py 3.2 ou Py 3.3 and compare strings manipulations. Py 3.3 is
always slower. Period.


You have not tried enough tests ;-).

On my Win7-64 system:
from timeit import timeit

print(timeit( 'a'*1 ))
3.3.0b2: .5
3.2.3: .8

print(timeit(c in a, c  = '…'; a = 'a'*1))
3.3: .05 (independent of len(a)!)
3.2: 5.8  100 times slower! Increase len(a) and the ratio can be made as 
high as one wants!


print(timeit(a.encode(), a = 'a'*1000))
3.2: 1.5
3.3:  .26

Similar with encoding='utf-8' added to call.

Jim, please stop the ranting. It does not help improve Python. utf-32 is 
not a panacea; it has problems of time, space, and system compatibility 
(Windows and others). Victor Stinner, whatever he may have once thought 
and said, put a *lot* of effort into making the new implementation both 
correct and fast.


On your replace example
 imeit.timeit(('ab…' * 1000).replace('…', '……'))
 61.919225272152346
 timeit.timeit(('ab…' * 10).replace('…', 'œ…'))
 1.2918679017971044

I do not see the point of changing both length and replacement. For me, 
the time is about the same for either replacement. I do see about the 
same slowdown ratio for 3.3 versus 3.2 I also see it for pure search 
without replacement.


print(timeit(c in a, c  = '…'; a = 'a'*1000+c))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread wxjmfauth
Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :
 On Aug 18, 10:59 pm, Steven D'Aprano steve
 
 +comp.lang.pyt...@pearwood.info wrote:
 
  On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:
 
   Is there any reason why non ascii users are somehow penalized compared
 
   to ascii users?
 
 
 
  Of course there is a reason.
 
 
 
  If you want to represent 1114111 different characters in a string, as
 
  Unicode supports, you can't use a single byte per character, or even two
 
  bytes. That is a fact of basic mathematics. Supporting 1114111 characters
 
  must be more expensive than supporting 128 of them.
 
 
 
  But why should you carry the cost of 4-bytes per character just because
 
  someday you *might* need a non-BMP character?
 
 
 
 I am reminded of: 
 http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605
 
 
 
 Original above does not open for me but here's a copy that does:
 
 
 
 http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html

I thing it's time to leave the discussion and to go to bed.

You can take the problem the way you wish, Python 3.3 is slower
than Python 3.2.

If you see the present status as an optimisation, I'm condidering
this as a regression.

I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.

To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Mark Lawrence

On 18/08/2012 21:22, wxjmfa...@gmail.com wrote:

Le samedi 18 août 2012 20:40:23 UTC+2, rusi a écrit :

On Aug 18, 10:59 pm, Steven D'Aprano steve

+comp.lang.pyt...@pearwood.info wrote:


On Sat, 18 Aug 2012 08:07:05 -0700, wxjmfauth wrote:



Is there any reason why non ascii users are somehow penalized compared



to ascii users?







Of course there is a reason.







If you want to represent 1114111 different characters in a string, as



Unicode supports, you can't use a single byte per character, or even two



bytes. That is a fact of basic mathematics. Supporting 1114111 characters



must be more expensive than supporting 128 of them.







But why should you carry the cost of 4-bytes per character just because



someday you *might* need a non-BMP character?




I am reminded of: 
http://answers.microsoft.com/thread/720108ee-0a9c-4090-b62d-bbd5cb1a7605



Original above does not open for me but here's a copy that does:



http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html


I thing it's time to leave the discussion and to go to bed.


In plain English, duck out cos I'm losing.



You can take the problem the way you wish, Python 3.3 is slower
than Python 3.2.


I'll ask for the second time.  Provide proof that is acceptable to 
everybody and not just yourself.




If you see the present status as an optimisation, I'm condidering
this as a regression.


Considering does not equate to proof.  Where are the figures which back 
up your claim?




I'm pretty sure a pure ucs-4/utf-32 can only be, by nature,
the correct solution.


I look forward to seeing your patch on the bug tracker.  If and only if 
you can find something that needs patching, which from the course of 
this thread I think is highly unlikely.





To be extreme, tools using pure utf-16 or utf-32 are, at least,
considering all the citizen on this planet in the same way.

jmf




--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 4:26 AM, Paul Rubin no.email@nospam.invalid wrote:
 Can you explain the issue of breaking surrogate pairs apart a little
 more?  Switching between encodings based on the string contents seems
 silly at first glance.  Strings are immutable so I don't understand why
 not use UTF-8 or UTF-16 for everything.  UTF-8 is more efficient in
 Latin-based alphabets and UTF-16 may be more efficient for some other
 languages.  I think even UCS-4 doesn't completely fix the surrogate pair
 issue if it means the only thing I can think of.

UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
few thousand bytes, how do you locate the 273rd character? You have to
scan from the beginning. The same applies when surrogate pairs are
used to represent single characters, unless the representation leaks
and a surrogate is indexed as two - which is where the breaking-apart
happens.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
 few thousand bytes, how do you locate the 273rd character? 

How often do you need to do that, as opposed to traversing the string by
iteration?  Anyway, you could use a rope-like implementation, or an
index structure over the string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 12:11 PM, Paul Rubin no.email@nospam.invalid wrote:
 Chris Angelico ros...@gmail.com writes:
 UTF-8 is highly inefficient for indexing. Given a buffer of (say) a
 few thousand bytes, how do you locate the 273rd character?

 How often do you need to do that, as opposed to traversing the string by
 iteration?  Anyway, you could use a rope-like implementation, or an
 index structure over the string.

Well, imagine if Python strings were stored in UTF-8. How would you slice it?

 asdfqwer[4:]
'qwer'

That's a not uncommon operation when parsing strings or manipulating
data. You'd need to completely rework your algorithms to maintain a
position somewhere.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 asdfqwer[4:]
 'qwer'

 That's a not uncommon operation when parsing strings or manipulating
 data. You'd need to completely rework your algorithms to maintain a
 position somewhere.

Scanning 4 characters (or a few dozen, say) to peel off a token in
parsing a UTF-8 string is no big deal.  It gets more expensive if you
want to index far more deeply into the string.  I'm asking how often
that is done in real code.  Obviously one can concoct hypothetical
examples that would suffer.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 12:35 PM, Paul Rubin no.email@nospam.invalid wrote:
 Chris Angelico ros...@gmail.com writes:
 asdfqwer[4:]
 'qwer'

 That's a not uncommon operation when parsing strings or manipulating
 data. You'd need to completely rework your algorithms to maintain a
 position somewhere.

 Scanning 4 characters (or a few dozen, say) to peel off a token in
 parsing a UTF-8 string is no big deal.  It gets more expensive if you
 want to index far more deeply into the string.  I'm asking how often
 that is done in real code.  Obviously one can concoct hypothetical
 examples that would suffer.

Sure, four characters isn't a big deal to step through. But it still
makes indexing and slicing operations O(N) instead of O(1), plus you'd
have to zark the whole string up to where you want to work. It'd be
workable, but you'd have to redo your algorithms significantly; I
don't have a Python example of parsing a huge string, but I've done it
in other languages, and when I can depend on indexing being a cheap
operation, I'll happily do exactly that.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Terry Reedy

On 8/18/2012 4:09 PM, Terry Reedy wrote:


print(timeit(c in a, c  = '…'; a = 'a'*1000+c))
# .6 in 3.2.3, 1.2 in 3.3.0

This does not make sense to me and I will ask about it.


I did ask on pydef list and paraphrased responses include:
1. 'My system gives opposite ratios.'
2. 'With a default of 100 repetitions in a loop, the reported times 
are microseconds per operation and thus not practically significant.'

3. 'There is a stringbench.py with a large number of such micro benchmarks.'

I believe there are also whole-application benchmarks that try to mimic 
real-world mixtures of operations.


People making improvements must consider performance on multiple systems 
and multiple benchmarks. If someone wants to work on search speed, they 
cannot just optimize that one operation on one system.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 Sure, four characters isn't a big deal to step through. But it still
 makes indexing and slicing operations O(N) instead of O(1), plus you'd
 have to zark the whole string up to where you want to work.

I know some systems chop the strings into blocks of (say) a few
hundred chars, so you can immediately get to the correct
block, then scan into the block to get to the desired char offset.

 I don't have a Python example of parsing a huge string, but I've done
 it in other languages, and when I can depend on indexing being a cheap
 operation, I'll happily do exactly that.

I'd be interested to know what the context was, where you parsed
a big unicode string in a way that required random access to
the nth character in the string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-18 Thread Chris Angelico
On Sun, Aug 19, 2012 at 1:10 PM, Paul Rubin no.email@nospam.invalid wrote:
 Chris Angelico ros...@gmail.com writes:
 I don't have a Python example of parsing a huge string, but I've done
 it in other languages, and when I can depend on indexing being a cheap
 operation, I'll happily do exactly that.

 I'd be interested to know what the context was, where you parsed
 a big unicode string in a way that required random access to
 the nth character in the string.

It's something I've done in C/C++ fairly often. Take one big fat
buffer, slice it and dice it as you get the information you want out
of it. I'll retain and/or calculate indices (when I'm not using
pointers, but that's a different kettle of fish). Generally, I'm
working with pure ASCII, but port those same algorithms to Python and
you'll easily be able to read in a file in some known encoding and
manipulate it as Unicode.

It's not so much 'random access to the nth character' as an efficient
way of jumping forward. For instance, if I know that the next thing is
a literal string of n characters (that I don't care about), I want to
skip over that and keep parsing. The Adobe Message Format is
particularly noteworthy in this, but it's a stupid format and I don't
recommend people spend too much time reading up on it (unless you like
that sensation of your brain trying to escape through your ear).

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Alister
On Thu, 16 Aug 2012 15:09:47 -0700, Charles Jensen wrote:

 Everyone knows that the python command
 
  ord(u'…')
 
 will output the number 8230 which is the unicode character for the
 horizontal ellipsis.
 
 How would I use ord() to find the unicode value of a string stored in a
 variable?
 
 So the following 2 lines of code will give me the ascii value of the
 variable a.  How do I specify ord to give me the unicode value of a?
 
  a = '…' ord(a)





the same way you did in your original example by defining the string ass 
unicode
a=u'...' ord(a)
-- 
Keep on keepin' on.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread wxjmfauth
Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
 a = '…'
 
 print(ord(a))
 
  
 
 8230
 
 Most things with unicode are easier in 3.x, and some are even better in 
 
 3.3. The current beta is good enough for most informal work. 3.3.0 will 
 
 be out in a month.
 
 
 
 -- 
 
 Terry Jan Reedy

Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

 '…'.encode('cp1252')
b'\x85'
 '…'.encode('mac-roman')
b'\xc9'
 '…'.encode('iso-8859-1') # latin-1
Traceback (most recent call last):
  File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When TeX moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, they know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Jerry Hill
On Fri, Aug 17, 2012 at 1:49 PM,  wxjmfa...@gmail.com wrote:
 The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
 is one of these characters existing in the cp1252, mac-roman
 coding schemes and not in iso-8859-1 (latin-1) and obviously
 not in ascii. It causes Py3.3 to work a few 100% slower
 than Py3.3 versions due to the flexible string representation
 (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

 '…'.encode('cp1252')
 b'\x85'
 '…'.encode('mac-roman')
 b'\xc9'
 '…'.encode('iso-8859-1') # latin-1
 Traceback (most recent call last):
   File eta last command, line 1, in module
 UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
 in position 0: ordinal not in range(256)

 If one could neglect this (typographically important) glyph, what
 to say about the characters of the European scripts (languages)
 present in cp1252 or in mac-roman but not in latin-1 (eg. the
 French script/language)?

So... python should change the longstanding definition of the latin-1
character set?  This isn't some sort of python limitation, it's just
the reality of legacy encodings that actually exist in the real world.


 Very nice. Python 2 was built for ascii user, now Python 3 is
 *optimized* for, let say, ascii user!

 The future is bright for Python. French users are better
 served with Apple or MS products, simply because these
 corporates know you can not write French with iso-8859-1.

 PS When TeX moved from the ascii encoding to iso-8859-1
 and the so called Cork encoding, they know this and provided
 all the complementary packages to circumvent this. It was
 in 199? (Python was not even born).

 Ditto for the foundries (Adobe, Linotype, ...)


I don't understand what any of this has to do with Python.  Just
output your text in UTF-8 like any civilized person in the 21st
century, and none of that is a problem at all.  Python make that easy.
 It also makes it easy to interoperate with older encodings if you
have to.

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread wxjmfauth
Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
 On Fri, Aug 17, 2012 at 1:49 PM,  wxjmfa...@gmail.com wrote:
 
  The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
 
  is one of these characters existing in the cp1252, mac-roman
 
  coding schemes and not in iso-8859-1 (latin-1) and obviously
 
  not in ascii. It causes Py3.3 to work a few 100% slower
 
  than Py3.3 versions due to the flexible string representation
 
  (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
 
 
 
  '…'.encode('cp1252')
 
  b'\x85'
 
  '…'.encode('mac-roman')
 
  b'\xc9'
 
  '…'.encode('iso-8859-1') # latin-1
 
  Traceback (most recent call last):
 
File eta last command, line 1, in module
 
  UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
 
  in position 0: ordinal not in range(256)
 
 
 
  If one could neglect this (typographically important) glyph, what
 
  to say about the characters of the European scripts (languages)
 
  present in cp1252 or in mac-roman but not in latin-1 (eg. the
 
  French script/language)?
 
 
 
 So... python should change the longstanding definition of the latin-1
 
 character set?  This isn't some sort of python limitation, it's just
 
 the reality of legacy encodings that actually exist in the real world.
 
 
 
 
 
  Very nice. Python 2 was built for ascii user, now Python 3 is
 
  *optimized* for, let say, ascii user!
 
 
 
  The future is bright for Python. French users are better
 
  served with Apple or MS products, simply because these
 
  corporates know you can not write French with iso-8859-1.
 
 
 
  PS When TeX moved from the ascii encoding to iso-8859-1
 
  and the so called Cork encoding, they know this and provided
 
  all the complementary packages to circumvent this. It was
 
  in 199? (Python was not even born).
 
 
 
  Ditto for the foundries (Adobe, Linotype, ...)
 
 
 
 
 
 I don't understand what any of this has to do with Python.  Just
 
 output your text in UTF-8 like any civilized person in the 21st
 
 century, and none of that is a problem at all.  Python make that easy.
 
  It also makes it easy to interoperate with older encodings if you
 
 have to.
 

Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python string in the code source or with
the display of a Python3 str.
I wrote about the *internal* Python coding, the
way Python keeps strings in memory. See PEP 393.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Dave Angel
On 08/17/2012 02:45 PM, wxjmfa...@gmail.com wrote:
 Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
 SNIP

 I don't understand what any of this has to do with Python.  Just

 output your text in UTF-8 like any civilized person in the 21st

 century, and none of that is a problem at all.  Python make that easy.

  It also makes it easy to interoperate with older encodings if you

 have to.

 Sorry, you missed the point.

 My comment had nothing to do with the code source coding,
 the coding of a Python string in the code source or with
 the display of a Python3 str.
 I wrote about the *internal* Python coding, the
 way Python keeps strings in memory. See PEP 393.

 jmf

The internal coding described in PEP 393 has nothing to do with latin-1
encoding.  So what IS your point?  Make it clearly, without all the
snide side-comments.



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Ian Kelly
On Aug 17, 2012 2:58 PM, Dave Angel d...@davea.name wrote:

 The internal coding described in PEP 393 has nothing to do with latin-1
 encoding.

It certainly does. PEP 393 provides for Unicode strings to be represented
internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
sufficient to contain the data. I understand the complaint to be that while
the change is great for strings that happen to fit in Latin-1, it is less
efficient than previous versions for strings that do not.

I don't know how much merit there is to this claim. It would seem to me
that even in non-western locales, most strings are likely to be Latin-1 or
even ASCII, e.g.  class and attribute and function names.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Dave Angel
On 08/17/2012 08:21 PM, Ian Kelly wrote:
 On Aug 17, 2012 2:58 PM, Dave Angel d...@davea.name wrote:
 The internal coding described in PEP 393 has nothing to do with latin-1
 encoding.
 It certainly does. PEP 393 provides for Unicode strings to be represented
 internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
 sufficient to contain the data. I understand the complaint to be that while
 the change is great for strings that happen to fit in Latin-1, it is less
 efficient than previous versions for strings that do not.

That's not the way I interpreted the PEP 393.  It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point.   Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed.   I take it
as a coincidence that it happens to match latin-1;  that's the way
Unicode happened historically, and is not Python's fault.  Am I reading
it wrong?

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems).  So unless French has code points
over 64k, I can't figure that anything is lost.

I have no idea about the times involved, so i wanted a more specific
complaint.

 I don't know how much merit there is to this claim. It would seem to me
 that even in non-western locales, most strings are likely to be Latin-1 or
 even ASCII, e.g.  class and attribute and function names.



The jmfauth rant I was responding to was saying that French isn't
efficiently encoded, and that performance of some vague operations were
somehow reduced by several fold.  I was just trying to get him to be
more specific.



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Steven D'Aprano
On Fri, 17 Aug 2012 11:45:02 -0700, wxjmfauth wrote:

 Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
 On Fri, Aug 17, 2012 at 1:49 PM,  wxjmfa...@gmail.com wrote:
 
  The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
  is one of these characters existing in the cp1252, mac-roman
  coding schemes and not in iso-8859-1 (latin-1) and obviously
  not in ascii. It causes Py3.3 to work a few 100% slower
  than Py3.3 versions due to the flexible string representation
  (ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
[...]
 Sorry, you missed the point.
 
 My comment had nothing to do with the code source coding, the coding of
 a Python string in the code source or with the display of a Python3
 str.
 I wrote about the *internal* Python coding, the way Python keeps
 strings in memory. See PEP 393.


The PEP does not support your claim that flexible string storage is 100% 
to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60% 
of the memory used for strings.

I don't really understand what message you are trying to give here. Are 
you saying that PEP 393 is a good thing or a bad thing?

In Python 1.x, there was no support for Unicode at all. You could only 
work with pure byte strings. Support for non-ascii characters like … ∞ é ñ
£ π Ж ش was purely by accident -- if your terminal happened to be set to 
an encoding that supported a character, and you happened to use the 
appropriate byte value, you might see the character you wanted.

In Python 2.2, Python gained support for Unicode. You could now guarantee 
support for any Unicode character in the Basic Multilingual Plane (BMP) 
by writing your strings using the u... style. In Python 3, you no 
longer need the leading U, all strings are unicode.

But there is a problem: if your Python interpreter is a narrow build, 
it *only* supports Unicode characters in the BMP. When Python is a wide 
build, compiled with support for the additional character planes, then 
strings take much more memory, even if they are in the BMP, or are simple 
ASCII strings.

PEP 393 fixes this problem and gets rid of the distinction between narrow 
and wide builds. From Python 3.3 onwards, all Python compilers will have 
the same support for unicode, rather than most being BMP-only. Each 
individual string's internal storage will use only as many bytes-per-
character as needed to store the largest character in the string.

This will save a lot of memory for those using mostly ASCII or Latin-1 
but a few multibyte characters. While the increased complexity causes a 
small slowdown, the increased functionality makes it well worthwhile.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-17 Thread Steven D'Aprano
On Fri, 17 Aug 2012 23:30:22 -0400, Dave Angel wrote:

 On 08/17/2012 08:21 PM, Ian Kelly wrote:
 On Aug 17, 2012 2:58 PM, Dave Angel d...@davea.name wrote:
 The internal coding described in PEP 393 has nothing to do with
 latin-1 encoding.
 It certainly does. PEP 393 provides for Unicode strings to be
 represented internally as any of Latin-1, UCS-2, or UCS-4, whichever is
 smallest and sufficient to contain the data. 

Unicode strings are not represented as Latin-1 internally. Latin-1 is a 
byte encoding, not a unicode internal format. Perhaps you mean to say 
that they are represented as a single byte format?

 I understand the complaint
 to be that while the change is great for strings that happen to fit in
 Latin-1, it is less efficient than previous versions for strings that
 do not.
 
 That's not the way I interpreted the PEP 393.  It takes a pure unicode
 string, finds the largest code point in that string, and chooses 1, 2 or
 4 bytes for every character, based on how many bits it'd take for that
 largest code point.

That's how I interpret it too.


 Further i read it to mean that only 00 bytes would
 be dropped in the process, no other bytes would be changed.

Just to clarify, you aren't talking about the \0 character, but only to 
extraneous padding 00 bytes.


 I also figure this is going to be more space efficient than Python 3.2
 for any string which had a max code point of 65535 or less (in Windows),
 or 4billion or less (in real systems).  So unless French has code points
 over 64k, I can't figure that anything is lost.

I think that on narrow builds, it won't make terribly much difference. 
The big savings are for wide builds.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


How do I display unicode value stored in a string variable using ord()

2012-08-16 Thread Charles Jensen
Everyone knows that the python command

 ord(u'…')

will output the number 8230 which is the unicode character for the horizontal 
ellipsis.

How would I use ord() to find the unicode value of a string stored in a 
variable?  

So the following 2 lines of code will give me the ascii value of the variable 
a.  How do I specify ord to give me the unicode value of a?

 a = '…'
 ord(a)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-16 Thread Chris Angelico
On Fri, Aug 17, 2012 at 8:09 AM, Charles Jensen
hopefullychar...@gmail.com wrote:
 How would I use ord() to find the unicode value of a string stored in a 
 variable?

 So the following 2 lines of code will give me the ascii value of the variable 
 a.  How do I specify ord to give me the unicode value of a?

  a = '…'
  ord(a)

I presume you're talking about Python 2, because in Python 3 your
string variable is a Unicode string and will behave as you describe
above.

You'll need to look into what the encoding is, and figure it out from there.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-16 Thread Dave Angel
On 08/16/2012 06:09 PM, Charles Jensen wrote:
 Everyone knows that the python command

  ord(u'…')

 will output the number 8230 which is the unicode character for the horizontal 
 ellipsis.

 How would I use ord() to find the unicode value of a string stored in a 
 variable?  

 So the following 2 lines of code will give me the ascii value of the variable 
 a.  How do I specify ord to give me the unicode value of a?

  a = '…'
  ord(a)

You omitted the print statement.  You also didn't specify what version
of Python you're using;  I'll assume Python 2.x because in Python 3.x,
the uxx notation would have been a syntax error.

To get the ord of a unicode variable, you do it the same as a unicode
literal:

   a = uj #note: for this to work reliably, you probably
need the correct Unicode declaration in line 2 of the file
   print ord(a)

But if you have a byte string containing some binary bits, and you want
to get a unicode character value out of it, you'll need to explicitly
convert it to unicode.

First, decide what method the byte string was encoded.  If you specify
the wrong encoding, you'll likely to get an exception, or maybe just a
nonsense answer.

   a = \xc1\xc1#I just made this value up;  it's not
valid utf8
   b = a.decode(utf-8)
   print ord(b)



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-16 Thread Terry Reedy

a = '…'
print(ord(a))

8230
Most things with unicode are easier in 3.x, and some are even better in 
3.3. The current beta is good enough for most informal work. 3.3.0 will 
be out in a month.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list