Re: RE Module Performance
On Wed, Jul 31, 2013 at 6:45 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: if you care about minimizing every possible byte, you should use a low-level language like C. Then you can give every character 21 bits, and be happy that you don't waste even one bit. Could go better! Since not every character has been assigned, and some are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut them out of your representation system and save memory! ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 31-07-13 05:30, Michael Torrie schreef: On 07/30/2013 12:19 PM, Antoon Pardon wrote: So? Why are you making this a point of discussion? I was not aware that the pro and cons of various editor buffer implemantations was relevant to the point I was trying to make. I for one found it very interesting. In fact this thread caused me to wonder how one actually does create an efficient editor. Off the original topic true, but still very interesting. Yes, it can be interesting. But I really think if that is what you want to discuss, it deserves its own subject thread. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 30-07-13 21:09, wxjmfa...@gmail.com schreef: Matable, immutable, copyint + xxx, bufferint, O(n) Yes, but conceptualy the reencoding happen sometime, somewhere. Which is a far cry from your previous claim that it happened every time you enter a char. This of course make your case harder to argue. Because the impact of something that happens sometime, somewhere is vastly less than something that happens everytime you enter a char. The internal ucs-2 will never automagically be transformed into ucs-4 (eg). It will just start producing wrong results when someone starts using characters that don't fit into ucs-2. timeit.timeit('a'*1 +'€') 7.087220684719967 timeit.timeit('a'*1 +'z') 1.5685214234430873 timeit.timeit(z = 'a'*1; z = z +'€') 7.169538866162213 timeit.timeit(z = 'a'*1; z = z +'z') 1.5815893830557286 timeit.timeit(z = 'a'*1; z += 'z') 1.606955741596181 timeit.timeit(z = 'a'*1; z += '€') 7.160483334521416 And do not forget, in a pure utf coding scheme, your char or a char will *never* be larger than 4 bytes. sys.getsizeof('a') 26 sys.getsizeof('\U000101000') 48 Nonsense. sys.getsizeof('a'.encode('utf-8')) 18 -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
FSR: === The 'a' in 'a€' and 'a\U0001d11e: ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')] ['0b', '0b0111', '0b0010', '0b10101100'] ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')] ['0b', '0b', '0b', '0b0111', '0b', '0b0001', '0b11010001', '0b0000'] Has to be done. sys.getsizeof('a€') 42 sys.getsizeof('a\U0001d11e') 48 sys.getsizeof('aa') 27 Unicode/utf* i) (primary key) Create and use a unique set of encoded code points. ii) (secondary key) Depending of the wish, memory/performance: utf-8/16/32 Two advantages at the light of the above example: iii) The a has never to be reencoded. iv) An a size never exceeds 4 bytes. Hard job to solve/satisfy i), ii), iii) and iv) at the same time. Is is possible? ;-) The solution is in the problem. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 31-07-13 10:32, wxjmfa...@gmail.com schreef: Unicode/utf* i) (primary key) Create and use a unique set of encoded code points. FSR does this. st1 = 'a€' st2 = 'aa' ord(st1[0]) 97 ord(st2[0]) 97 ii) (secondary key) Depending of the wish, memory/performance: utf-8/16/32 Whose wish? I don't know any language that allows the programmer choose the internal representation of its strings. If it is the designers choice FSR does this, if it is the programmers choice, I don't see why this is necessary for compliance. Two advantages at the light of the above example: iii) The a has never to be reencoded. FSR: check. Using a container with wider slots is not a reëncoding. If such widening is encoding then your 'choice' between utf-8/16/32 implies that it will also have to reencode when it changes from utf-8 to utf-16 or utf-32. iv) An a size never exceeds 4 bytes. FSR: check. Hard job to solve/satisfy i), ii), iii) and iv) at the same time. Is is possible? ;-) The solution is in the problem. Mayby you should use bytes or bytearrays if that is really what you want. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/31/2013 01:23 AM, Antoon Pardon wrote: Op 31-07-13 05:30, Michael Torrie schreef: On 07/30/2013 12:19 PM, Antoon Pardon wrote: So? Why are you making this a point of discussion? I was not aware that the pro and cons of various editor buffer implemantations was relevant to the point I was trying to make. I for one found it very interesting. In fact this thread caused me to wonder how one actually does create an efficient editor. Off the original topic true, but still very interesting. Yes, it can be interesting. But I really think if that is what you want to discuss, it deserves its own subject thread. Subject lines can and should be changed to reflect the ebbs and flows of the discussion. In fact this thread's subject should have been changed a long time ago since the original topic was RE module performance! -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/31/2013 02:32 AM, wxjmfa...@gmail.com wrote: Unicode/utf* Why do you keep using the terms utf and Unicode interchangeably? -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit : On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote: And do not forget, in a pure utf coding scheme, your char or a char will *never* be larger than 4 bytes. sys.getsizeof('a') 26 sys.getsizeof('\U000101000') 48 Neither character above is larger than 4 bytes. You forgot to deduct the size of the object header. Python is a high-level object-oriented language, if you care about minimizing every possible byte, you should use a low-level language like C. Then you can give every character 21 bits, and be happy that you don't waste even one bit. -- Steven ... char never consumes or requires more than 4 bytes ... jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Wed, Jul 31, 2013 at 9:15 PM, wxjmfa...@gmail.com wrote: ... char never consumes or requires more than 4 bytes ... The integer 5 should be able to be stored in 3 bits. sys.getsizeof(5) 14 Clearly Python is doing something really horribly wrong here. In fact, sys.getsizeof needs to be changed to return a float, to allow it to more properly reflect these important facts. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit : On Sat, Jul 27, 2013 at 12:21 PM, wxjmfa...@gmail.com wrote: Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Yes, given a pointer location into a utf-8 or utf-16 string, it is easy to determine the identity of the code point at that location. But this is not often a useful operation, save for resynchronization in the case that the string data is corrupted. The caret of an editor does not conceptually correspond to a pointer location, but to a character index. Given a particular character index (e.g. 127504), an editor must be able to determine the identity and/or the memory location of the character at that index, and for UTF-8 and UTF-16 without an auxiliary data structure that is a O(n) operation. -- Same conceptual mistake as Steven's example with its buffers, the buffer does not know it holds characters. This is not the point to discuss. - I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Somebody wrote FSR is just an optimization. Yes, but in case of an editor à la FSR, this optimization take place everytime you enter a char. Your poor editor, in fact the FSR, is finally spending its time in optimizing and finally it optimizes nothing. (It is even worse). If you type correctly a z instead of an €, it is not necessary to reencode the buffer. Problem, you do you know that you do not have to reencode? simple just check it, and by just checking it wastes time to test it you have to optimized or not and hurt a little bit more what is supposed to be an optimization. Do not confuse the process of optimisation and the result of optimization (funny, it's like the utf's). There is a trick to make the editor to know if it has to be optimized. Just put some flag somewhere. Then you fall on the Houston syndrome. Houston, we got a problem, our buffer consumes much more bytes than expected. sys.getsizeof('€') 40 sys.getsizeof('a') 26 Now the good news. In an editor à la FSR, the composition is not so important. You know, practicality beats purity. The hard job is the text rendering engine and the handling of the font (even in a raw unicode editor). And as these tools are luckily not woking à la FSR (probably because they understand the coding of the characters), your editor is still working not so badly. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. So adding characters would mean continuously copying the string buffer into a new string with the next character added. Copying 127504 characters into a new string will not make that much of a difference whether the octets are just copied to octets or are unpacked into 32 bit words. Somebody wrote FSR is just an optimization. Yes, but in case of an editor à la FSR, this optimization take place everytime you enter a char. Your poor editor, in fact the FSR, is finally spending its time in optimizing and finally it optimizes nothing. (It is even worse). Even if you would do it this way, it would *not* take place every time you enter a char. Once your buffer would contain a wide character, it would just need to convert the single character that is added after each keystroke. It would not need to convert the whole buffer after each key stroke. If you type correctly a z instead of an €, it is not necessary to reencode the buffer. Problem, you do you know that you do not have to reencode? simple just check it, and by just checking it wastes time to test it you have to optimized or not and hurt a little bit more what is supposed to be an optimization. Your scenario is totally unrealistic. First of all because of the immutable nature of python strings, second because you suggest that real time usage would result in frequent conversions which is highly unlikely. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Tue, Jul 30, 2013 at 3:01 PM, wxjmfa...@gmail.com wrote: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. You're still thinking that the editor's buffer is a Python string. As I've shown earlier, this is a really bad idea, and that has nothing to do with FSR/PEP 393. An immutable string is *horribly* inefficient at this; if you want to keep concatenating onto a string, the recommended method is a list of strings that gets join()d at the end, and the same technique works well here. Here's a little demo class that could make the basis for such a system: class EditorBuffer: def __init__(self,fn): self.fn=fn self.buffer=[open(fn).read()] def insert(self,pos,char): if pos==0: # Special case: insertion at beginning of buffer if len(self.buffer[0])1024: self.buffer.insert(0,char) else: self.buffer[0]=char+self.buffer[0] return for idx,part in enumerate(self.buffer): l=len(part) if posl: pos-=l continue if posl: # Cursor is somewhere inside this string splitme=self.buffer[idx] self.buffer[idx:idx+1]=splitme[:pos],splitme[pos:] l=pos # Cursor is now at the end of this string if l1024: self.buffer[idx:idx+1]=self.buffer[idx],char else: self.buffer[idx]+=char return raise ValueError(Cannot insert past end of buffer) def __str__(self): return ''.join(self.buffer) def save(self): open(fn,w).write(str(self)) It guarantees that inserts will never need to resize more than 1KB of text. As a real basis for an editor, it still sucks, but it's purely to prove this one point. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 30/07/2013 15:38, Antoon Pardon wrote: Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. So adding characters would mean continuously copying the string buffer into a new string with the next character added. Copying 127504 characters into a new string will not make that much of a difference whether the octets are just copied to octets or are unpacked into 32 bit words. Somebody wrote FSR is just an optimization. Yes, but in case of an editor à la FSR, this optimization take place everytime you enter a char. Your poor editor, in fact the FSR, is finally spending its time in optimizing and finally it optimizes nothing. (It is even worse). Even if you would do it this way, it would *not* take place every time you enter a char. Once your buffer would contain a wide character, it would just need to convert the single character that is added after each keystroke. It would not need to convert the whole buffer after each key stroke. If you type correctly a z instead of an €, it is not necessary to reencode the buffer. Problem, you do you know that you do not have to reencode? simple just check it, and by just checking it wastes time to test it you have to optimized or not and hurt a little bit more what is supposed to be an optimization. Your scenario is totally unrealistic. First of all because of the immutable nature of python strings, second because you suggest that real time usage would result in frequent conversions which is highly unlikely. What you would have is a list of mutable chunks. Inserting into a chunk would be fast, and a chunk would be split if it's already full. Also, small adjacent chunks would be joined together. Finally, a chunk could use FSR to reduce memory usage. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 30-07-13 18:13, MRAB schreef: On 30/07/2013 15:38, Antoon Pardon wrote: Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. Not if you use a gap buffer. -- Antoon Pardon. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 30/07/2013 17:39, Antoon Pardon wrote: Op 30-07-13 18:13, MRAB schreef: On 30/07/2013 15:38, Antoon Pardon wrote: Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. Not if you use a gap buffer. The disadvantage there is that when you move the cursor you must move characters around. For example, what if the cursor was at the start and you wanted to move it to the end? Also, when the gap has been filled, you need to make a new one. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 31 July 2013 00:01, wxjmfa...@gmail.com wrote: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. And here we come to the root of your complete misunderstanding and mischaracterisation of the FSR. You don't appear to understand that strings in Python are immutable and that to add a character to an existing string requires copying the entire string + new character. In your hypothetical situation above, you have already performed 127504 copy + new character operations before you ever get to a single widening operation. The overhead of the copy + new character repeated 127504 times dwarfs the overhead of a single widening operation. Given your misunderstanding, it's no surprise that you are focused on microbenchmarks that demonstrate that copying entire strings and adding a character can be slower in some situations than others. When the only use case you have is implementing the buffer of an editor using an immutable string I can fully understand why you would be concerned about the performance of adding and removing individual characters. However, in that case *you're focused on the wrong problem*. Until you can demonstrate an understanding that doing the above in any language which has immutable strings is completely insane you will have no credibility and the only interest anyone will pay to your posts is refuting your FUD so that people new to the language are not driven off by you. Tim Delaney -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 30 July 2013 17:39, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: Op 30-07-13 18:13, MRAB schreef: On 30/07/2013 15:38, Antoon Pardon wrote: Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. Not if you use a gap buffer. Additionally, who says a language couldn't use, say, B-Trees for all of its list-like types, including strings? -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 30-07-13 19:14, MRAB schreef: On 30/07/2013 17:39, Antoon Pardon wrote: Op 30-07-13 18:13, MRAB schreef: On 30/07/2013 15:38, Antoon Pardon wrote: Op 30-07-13 16:01, wxjmfa...@gmail.com schreef: I am pretty sure that once you have typed your 127504 ascii characters, you are very happy the buffer of your editor does not waste time in reencoding the buffer as soon as you enter an €, the 125505th char. Sorry, I wanted to say z instead of euro, just to show that backspacing the last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. Not if you use a gap buffer. The disadvantage there is that when you move the cursor you must move characters around. For example, what if the cursor was at the start and you wanted to move it to the end? Also, when the gap has been filled, you need to make a new one. So? Why are you making this a point of discussion? I was not aware that the pro and cons of various editor buffer implemantations was relevant to the point I was trying to make. If you prefer an other data structure in the editor you are working on, I will not dissuade you. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Matable, immutable, copyint + xxx, bufferint, O(n) Yes, but conceptualy the reencoding happen sometime, somewhere. The internal ucs-2 will never automagically be transformed into ucs-4 (eg). timeit.timeit('a'*1 +'€') 7.087220684719967 timeit.timeit('a'*1 +'z') 1.5685214234430873 timeit.timeit(z = 'a'*1; z = z +'€') 7.169538866162213 timeit.timeit(z = 'a'*1; z = z +'z') 1.5815893830557286 timeit.timeit(z = 'a'*1; z += 'z') 1.606955741596181 timeit.timeit(z = 'a'*1; z += '€') 7.160483334521416 And do not forget, in a pure utf coding scheme, your char or a char will *never* be larger than 4 bytes. sys.getsizeof('a') 26 sys.getsizeof('\U000101000') 48 jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Tue, Jul 30, 2013 at 8:09 PM, wxjmfa...@gmail.com wrote: Matable, immutable, copyint + xxx, bufferint, O(n) Yes, but conceptualy the reencoding happen sometime, somewhere. The internal ucs-2 will never automagically be transformed into ucs-4 (eg). But probably not on the entire document. With even a brainless scheme like I posted code for, no more than 1024 bytes will need to be recoded at a time (except in some odd edge cases, and even then, no more than once for any given file). And do not forget, in a pure utf coding scheme, your char or a char will *never* be larger than 4 bytes. sys.getsizeof('a') 26 sys.getsizeof('\U000101000') 48 Yeah, you have a few odd issues like, oh, I dunno, GC overhead, reference count, object class, and string length, all stored somewhere there. Honestly jmf, if you want raw assembly you know where to get it. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 7/30/2013 1:40 PM, Joshua Landau wrote: Additionally, who says a language couldn't use, say, B-Trees for all of its list-like types, including strings? Tk apparently uses a B-tree in its text widget. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
MRAB: The disadvantage there is that when you move the cursor you must move characters around. For example, what if the cursor was at the start and you wanted to move it to the end? Also, when the gap has been filled, you need to make a new one. The normal technique is to only move the gap when text is added or removed, not when the cursor moves. Code that reads the contents, such as for display, handles the gap by checking the requested position and using a different offset when the position is after the gap. Gap buffers work well because changes are generally close to the previous change, so require moving only a relatively small amount of text. Even an occasional move of the whole contents won't cause too much trouble for interactivity with current processors moving multiple megabytes per millisecond. Neil -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/30/2013 12:19 PM, Antoon Pardon wrote: So? Why are you making this a point of discussion? I was not aware that the pro and cons of various editor buffer implemantations was relevant to the point I was trying to make. I for one found it very interesting. In fact this thread caused me to wonder how one actually does create an efficient editor. Off the original topic true, but still very interesting. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/30/2013 01:09 PM, wxjmfa...@gmail.com wrote: Matable, immutable, copyint + xxx, bufferint, O(n) Yes, but conceptualy the reencoding happen sometime, somewhere. The internal ucs-2 will never automagically be transformed into ucs-4 (eg). So what major python project are you working on where you've found FSR in general to be a problem? Maybe we can help you work out a more appropriate data structure and algorithm to use. But if you're not developing something, and not developing in Python, perhaps you should withdraw and let us use our horrible FSR in peace, because it doesn't seem to bother the vast majority of python programmers, and does not bother some large python projects out there. In fact I think most of us welcome integrated, correct, full unicode. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote: And do not forget, in a pure utf coding scheme, your char or a char will *never* be larger than 4 bytes. sys.getsizeof('a') 26 sys.getsizeof('\U000101000') 48 Neither character above is larger than 4 bytes. You forgot to deduct the size of the object header. Python is a high-level object-oriented language, if you care about minimizing every possible byte, you should use a low-level language like C. Then you can give every character 21 bits, and be happy that you don't waste even one bit. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 26-07-13 15:21, wxjmfa...@gmail.com schreef: Hint: To understand Unicode (and every coding scheme), you should understand utf. The how and the *why*. No you don't. You are mixing the information with how the information is coded. utf is like base64, a way of coding the information that is usefull for storage or transfer. But once you have decode the byte stream, you no longer need any understanding of base64 to process your information. Likewise, once you have decode the bytestream into uniocde information you don't need knowledge of utf to process unicode strings. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 28-07-13 20:19, Joshua Landau schreef: On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be mailto:antoon.par...@rece.vub.ac.be wrote: Op 27-07-13 20:21, wxjmfa...@gmail.com mailto:wxjmfa...@gmail.com schreef: utf-8 or any (utf) never need and never spend their time in reencoding. So? That python sometimes needs to do some kind of background processing is not a problem, whether it is garbage collection, allocating more memory, shufling around data blocks or reencoding a string, that doesn't matter. If you've got a real world example where one of those things noticeably slows your program down or makes the program behave faulty then you have something that is worthy of attention. Somewhat off topic, but befitting of the triviality of this thread, do I understand correctly that you are saying garbage collection never causes any noticeable slowdown in real-world circumstances? That's not remotely true. No that is not what I am saying. But if jmf would be complaining about garbage collection in an analog way as he is complaining about the FSR, he wouldn't be complaining about real-world circumstances but about theorectical possibilities and micro bench marks. In those circunstances the garbage collection problem wouldn't be worthy of attention much. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 28-07-13 21:30, wxjmfa...@gmail.com schreef: To be short, this is *never* the FSR, always something else. Suggestion. Start by solving all these micro-benchmarks. all the memory cases. It a good start, no? There is nothing to solve. Unicode doesn't force implementations to use the same size of memory for strings of the same length. So you pointing out examples of same length strings that don't use the same size of memory doesn't point at something that must be solved. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Sun, Jul 28, 2013 at 11:14 PM, Joshua Landau jos...@landau.ws wrote: GC does have sometimes severe impact in memory-constrained environments, though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/, about half-way down, specifically http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png. The best verification of these graphs I could find was https://blog.mozilla.org/nnethercote/category/garbage-collection/, although it's not immediately clear in Chrome's and Opera's case mainly due to none of the benchmarks pushing memory usage significantly. I also don't quite agree with the first post (sealedabstract) because I get by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS is just really heavy. Nonetheless, the benchmarks aren't lying. The ultimate in non-managed memory (the opposite of a GC) would have to be the assembly language programming I did in my earlier days, firing up DEBUG.EXE and writing a .COM file that lived inside a single 64KB segment for everything (256-byte Program Segment Prefix, then code, then initialized data, then uninitialized data and stack), crashing the computer with remarkable ease. Everything higher level than that (even malloc/free) has its conveniences and its costs, usually memory wastage. If you malloc random-sized blocks, free them at random, and ensure that your total allocated size stays below some limit, you'll still eventually run yourself out of memory. This is unsurprising. The only question is, how bad is the wastage and how much time gets devoted to it? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit : On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote: Do not forget that à la FSR mechanism for a non-ascii user is *irrelevant*. You have been told repeatedly, Python's internals are *full* of ASCII- only strings. py dir(list) ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] There's 45 ASCII-only strings right there, in only one built-in type, out of dozens. There are dozens, hundreds of ASCII-only strings in Python: builtin functions and classes, attributes, exceptions, internal attributes, variable names, and so on. You already know this, and yet you persist in repeating nonsense. -- Steven 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 For the record, I do not put your example to contradict you. I was expecting such a result even before testing. Now, if you do not understand why, you do not understand. There nothing wrong. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On Mon, Jul 29, 2013 at 12:43 PM, wxjmfa...@gmail.com wrote: Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit : 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 3.2: len(dir(list)) 42 3.3: len(dir(list)) 45 Wonder if that might maybe have an impact on the timings. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com: 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 For the record, I do not put your example to contradict you. I was expecting such a result even before testing. Now, if you do not understand why, you do not understand. There nothing wrong. Please give a single *proof* (not your gut feeling) that this is related to the FSR, and not rather due to other side-effects such as changes in how dir() works or (as Chris pointed out) due to more members on the list type in 3.3. If you can't or won't give that proof, there's no sense in continuing the discussion. -- --- Heiko. -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On 07/29/2013 08:06 AM, Heiko Wundram wrote: Am 29.07.2013 13:43, schrieb wxjmfa...@gmail.com: 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 For the record, I do not put your example to contradict you. I was expecting such a result even before testing. Now, if you do not understand why, you do not understand. There nothing wrong. Please give a single *proof* (not your gut feeling) that this is related to the FSR, and not rather due to other side-effects such as changes in how dir() works or (as Chris pointed out) due to more members on the list type in 3.3. If you can't or won't give that proof, there's no sense in continuing the discussion. Wow! The RE Module thread I created is evolving into Unicode topics. That thread grew up so fast! DCJ -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit : On Mon, Jul 29, 2013 at 12:43 PM, wxjmfa...@gmail.com wrote: Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit : 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 3.2: len(dir(list)) 42 3.3: len(dir(list)) 45 Wonder if that might maybe have an impact on the timings. ChrisA Good point. I stupidely forgot this. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le dimanche 28 juillet 2013 19:36:00 UTC+2, Terry Reedy a écrit : On 7/28/2013 11:52 AM, Michael Torrie wrote: 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that slicing a string would be very very slow, Not necessarily so. See below. and that's unacceptable for the use cases of python strings. I'm assuming you understand big O notation, as you talk of experience in many languages over the years. FSR and UTF-32 both are O(1) for slicing and lookups. Slicing is at least O(m) where m is the length of the slice. UTF-8, 16 and any variable-width encoding are always O(n).\ I posted about a week ago, in response to Chris A., a method by which lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, O(1+log2(k+1)), where k is the number of non-BMP chars in the string. This uses an auxiliary array of k ints. An auxiliary array of n ints would make UFT-16 lookup O(1), but then one is using more space than with UFT-32. Similar comments apply to UTF-8. The unicode standard says that a single strings should use exactly one coding scheme. It does *not* say that all strings in an application must use the same scheme. I just rechecked a few days ago. It also does not say that an application cannot associate additional data with a string to make processing of the string easier. -- Terry Jan Reedy To my knowledge, the Unicode doc always speak about the misc. utf* coding schemes in an exclusive or way. Having multiple encoded strings is one thing. Manipulating multiple encoded strings is something else. Maybe the mistake was to not emphasize the fact that one has to work with a unique set of encoded code points (utf-8 or utf-16 or utf-32) because it was considered, as to obvious one can not work properly with multiple coding schemes. You are also right in saying ...application cannot associate additional data The doc does not specify it either. It is superfleous. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le lundi 29 juillet 2013 13:57:47 UTC+2, Chris Angelico a écrit : On Mon, Jul 29, 2013 at 12:43 PM, wxjmfa...@gmail.com wrote: Le dimanche 28 juillet 2013 22:52:16 UTC+2, Steven D'Aprano a écrit : 3.2 timeit.timeit(r = dir(list)) 22.300465007102908 3.3 timeit.timeit(r = dir(list)) 27.13981129541519 3.2: len(dir(list)) 42 3.3: len(dir(list)) 45 Wonder if that might maybe have an impact on the timings. ChrisA class C: a = 'abc' b = 'def' def aaa(self): pass def bbb(self): pass def ccc(self): pass if __name__ == '__main__': import timeit print(timeit.timeit(r = dir(C), setup=from __main__ import C)) c:\python32\pythonw -u timitmod.py 15.258061416225663 Exit code: 0 c:\Python33\pythonw -u timitmod.py 17.052203122286194 Exit code: 0 jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On Mon, Jul 29, 2013 at 3:20 PM, wxjmfa...@gmail.com wrote: c:\python32\pythonw -u timitmod.py 15.258061416225663 Exit code: 0 c:\Python33\pythonw -u timitmod.py 17.052203122286194 Exit code: 0 len(dir(C)) Did you even think to check that before you posted timings? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le lundi 29 juillet 2013 16:49:34 UTC+2, Chris Angelico a écrit : On Mon, Jul 29, 2013 at 3:20 PM, wxjmfa...@gmail.com wrote: c:\python32\pythonw -u timitmod.py 15.258061416225663 Exit code: 0 c:\Python33\pythonw -u timitmod.py 17.052203122286194 Exit code: 0 len(dir(C)) Did you even think to check that before you posted timings? ChrisA Boum, no! the diff is one. I have however noticed, I can increase the number of attributes (ascii), the timing differences is very well marked. I do not draw conclusions. Such a factor for one unit jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Op 27-07-13 20:21, wxjmfa...@gmail.com schreef: Quickly. sys.getsizeof() at the light of what I explained. 1) As this FSR works with multiple encoding, it has to keep track of the encoding. it puts is in the overhead of str class (overhead = real overhead + encoding). In such a absurd way, that a sys.getsizeof('€') 40 needs 14 bytes more than a sys.getsizeof('z') 26 You may vary the length of the str. The problem is still here. Not bad for a coding scheme. 2) Take a look at this. Get rid of the overhead. sys.getsizeof('b'*100 + 'c') 126 sys.getsizeof('b'*100 + '€') 240 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. So? The same effect can be seen with other datatypes. nr = 32767 sys.getsizeof(nr) 14 nr += 1 sys.getsizeof(nr) 16 This FSR is not even a copy of the utf-8. len(('b'*100 + '€').encode('utf-8')) 103 Why should it be? Why should a unicode string be a copy of its utf-8 encoding? That makes as much sense as expecting that a number would be a copy of its string reprensentation. utf-8 or any (utf) never need and never spend their time in reencoding. So? That python sometimes needs to do some kind of background processing is not a problem, whether it is garbage collection, allocating more memory, shufling around data blocks or reencoding a string, that doesn't matter. If you've got a real world example where one of those things noticeably slows your program down or makes the program behave faulty then you have something that is worthy of attention. Until then you are merely harboring a pet peeve. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
FSR and unicode compliance - was Re: RE Module Performance
On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote: Good point. FSR, nice tool for those who wish to teach Unicode. It is not every day, one has such an opportunity. I had a long e-mail composed, but decided to chop it down, but still too long. so I ditched a lot of the context, which jmf also seems to do. Apologies. 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32 is an official encoding. FSR only differs from UTF-32 in that the padding zeros are stripped off such that it is stored in the most compact form that can handle all the characters in string, which is always known at string creation time. Now you can argue many things, but to say FSR is not unicode compliant is quite a stretch! What unicode entities or characters cannot be stored in strings using FSR? What sequences of bytes in FSR result in invalid Unicode entities? 2. strings in Python *never change*. They are immutable. The + operator always copies strings character by character into a new string object, even if Python had used UTF-8 internally. If you're doing a lot of string concatenations, perhaps you're using the wrong data type. A byte buffer might be better for you, where you can stuff utf-8 sequences into it to your heart's content. 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that slicing a string would be very very slow, and that's unacceptable for the use cases of python strings. I'm assuming you understand big O notation, as you talk of experience in many languages over the years. FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any variable-width encoding are always O(n). A lot slower! 4. Unicode is, well, unicode. You seem to hop all over the place from talking about code points to bytes to bits, using them all interchangeably. And now you seem to be claiming that a particular byte encoding standard is by definition unicode (UTF-8). Or at least that's how it sounds. And also claim FSR is not compliant with unicode standards, which appears to me to be completely false. Is my understanding of these things wrong? -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On Sun, Jul 28, 2013 at 4:52 PM, Michael Torrie torr...@gmail.com wrote: Is my understanding of these things wrong? No, your understanding of those matters is fine. There's just one area you seem to be misunderstanding; you appear to think that jmf actually cares about logical argument. I gave up on that theory a long time ago, and now I respond for the benefit of those reading, rather than jmf himself. I've also given up on trying to figure out what he actually wants; the nearest I can come up with is that he's King Gama-esque - that he just wants to complain. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On 7/28/2013 11:52 AM, Michael Torrie wrote: 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that slicing a string would be very very slow, Not necessarily so. See below. and that's unacceptable for the use cases of python strings. I'm assuming you understand big O notation, as you talk of experience in many languages over the years. FSR and UTF-32 both are O(1) for slicing and lookups. Slicing is at least O(m) where m is the length of the slice. UTF-8, 16 and any variable-width encoding are always O(n).\ I posted about a week ago, in response to Chris A., a method by which lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, O(1+log2(k+1)), where k is the number of non-BMP chars in the string. This uses an auxiliary array of k ints. An auxiliary array of n ints would make UFT-16 lookup O(1), but then one is using more space than with UFT-32. Similar comments apply to UTF-8. The unicode standard says that a single strings should use exactly one coding scheme. It does *not* say that all strings in an application must use the same scheme. I just rechecked a few days ago. It also does not say that an application cannot associate additional data with a string to make processing of the string easier. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On Sun, Jul 28, 2013 at 6:36 PM, Terry Reedy tjre...@udel.edu wrote: I posted about a week ago, in response to Chris A., a method by which lookup for UTF-16 can be made O(log2 k), or perhaps more accurately, O(1+log2(k+1)), where k is the number of non-BMP chars in the string. Which is an optimization choice that favours strings containing very few non-BMP characters. To justify the extra complexity of out-of-band storage, you would need to be working with almost exclusively the BMP. That would drastically improve jmf's microbenchmarks which do exactly that, but it would penalize strings that are almost exclusively higher-codepoint characters. Its quality, then, would be based on a major survey of string usage: are there enough strings with mostly-BMP-but-a-few-SMP? Bearing in mind that pure BMP is handled better by PEP 393, so this is only of value when there are actually those mixed strings. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit : On Sat, Jul 27, 2013 at 12:21 PM, wxjmfa...@gmail.com wrote: Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Yes, given a pointer location into a utf-8 or utf-16 string, it is easy to determine the identity of the code point at that location. But this is not often a useful operation, save for resynchronization in the case that the string data is corrupted. The caret of an editor does not conceptually correspond to a pointer location, but to a character index. Given a particular character index (e.g. 127504), an editor must be able to determine the identity and/or the memory location of the character at that index, and for UTF-8 and UTF-16 without an auxiliary data structure that is a O(n) operation. 2) Take a look at this. Get rid of the overhead. sys.getsizeof('b'*100 + 'c') 126 sys.getsizeof('b'*100 + '€') 240 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. Large strings in practical usage do not need to be resized like this often. Python 3.3 has been in production use for months now, and you still have yet to produce any real-world application code that demonstrates a performance regression. If there is no real-world regression, then there is no problem. 3) Unicode compliance. We know retrospectively, latin-1, is was a bad choice. Unusable for 17 European languages. Believe of not. 20 years of Unicode of incubation is not long enough to learn it. When discussing once with a French Python core dev, one with commit access, he did not know one can not use latin-1 for the French language! Probably because for many French strings, one can. As far as I am aware, the only characters that are missing from Latin-1 are the Euro sign (an unfortunate victim of history), the ligature œ (I have no doubt that many users just type oe anyway), and the rare capital Ÿ (the miniscule version is present in Latin-1). All French strings that are fortunate enough to be absent these characters can be represented in Latin-1 and so will have a 1-byte width in the FSR. -- latin-1? that's not even truth. sys.getsizeof('a') 26 sys.getsizeof('ü') 38 sys.getsizeof('aa') 27 sys.getsizeof('aü') 39 jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: Op 27-07-13 20:21, wxjmfa...@gmail.com schreef: utf-8 or any (utf) never need and never spend their time in reencoding. So? That python sometimes needs to do some kind of background processing is not a problem, whether it is garbage collection, allocating more memory, shufling around data blocks or reencoding a string, that doesn't matter. If you've got a real world example where one of those things noticeably slows your program down or makes the program behave faulty then you have something that is worthy of attention. Somewhat off topic, but befitting of the triviality of this thread, do I understand correctly that you are saying garbage collection never causes any noticeable slowdown in real-world circumstances? That's not remotely true. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote: On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: Op 27-07-13 20:21, wxjmfa...@gmail.com schreef: utf-8 or any (utf) never need and never spend their time in reencoding. So? That python sometimes needs to do some kind of background processing is not a problem, whether it is garbage collection, allocating more memory, shufling around data blocks or reencoding a string, that doesn't matter. If you've got a real world example where one of those things noticeably slows your program down or makes the program behave faulty then you have something that is worthy of attention. Somewhat off topic, but befitting of the triviality of this thread, do I understand correctly that you are saying garbage collection never causes any noticeable slowdown in real-world circumstances? That's not remotely true. If it's done properly, garbage collection shouldn't hurt the *overall* performance of the app; most of the issues with GC timing are when one operation gets unexpectedly delayed for a GC run (making performance measurement hard, and such). It should certainly never cause your program to behave faultily, though I have seen cases where the GC run appears to cause the program to crash - something like this: some_string = buggy_call() ... gc() ... print(some_string) The buggy call mucked up the reference count, so the gc run actually wiped the string from memory - resulting in a segfault on next usage. But the GC wasn't at fault, the original call was. (Which, btw, was quite a debugging search, especially since the function in question wasn't my code.) ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 28/07/2013 19:13, wxjmfa...@gmail.com wrote: Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit : On Sat, Jul 27, 2013 at 12:21 PM, wxjmfa...@gmail.com wrote: Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Yes, given a pointer location into a utf-8 or utf-16 string, it is easy to determine the identity of the code point at that location. But this is not often a useful operation, save for resynchronization in the case that the string data is corrupted. The caret of an editor does not conceptually correspond to a pointer location, but to a character index. Given a particular character index (e.g. 127504), an editor must be able to determine the identity and/or the memory location of the character at that index, and for UTF-8 and UTF-16 without an auxiliary data structure that is a O(n) operation. 2) Take a look at this. Get rid of the overhead. sys.getsizeof('b'*100 + 'c') 126 sys.getsizeof('b'*100 + '€') 240 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. Large strings in practical usage do not need to be resized like this often. Python 3.3 has been in production use for months now, and you still have yet to produce any real-world application code that demonstrates a performance regression. If there is no real-world regression, then there is no problem. 3) Unicode compliance. We know retrospectively, latin-1, is was a bad choice. Unusable for 17 European languages. Believe of not. 20 years of Unicode of incubation is not long enough to learn it. When discussing once with a French Python core dev, one with commit access, he did not know one can not use latin-1 for the French language! Probably because for many French strings, one can. As far as I am aware, the only characters that are missing from Latin-1 are the Euro sign (an unfortunate victim of history), the ligature œ (I have no doubt that many users just type oe anyway), and the rare capital Ÿ (the miniscule version is present in Latin-1). All French strings that are fortunate enough to be absent these characters can be represented in Latin-1 and so will have a 1-byte width in the FSR. -- latin-1? that's not even truth. sys.getsizeof('a') 26 sys.getsizeof('ü') 38 sys.getsizeof('aa') 27 sys.getsizeof('aü') 39 sys.getsizeof('aa') - sys.getsizeof('a') 1 One byte per codepoint. sys.getsizeof('üü') - sys.getsizeof('ü') 1 Also one byte per codepoint. sys.getsizeof('ü') - sys.getsizeof('a') 12 Clearly there's more going on here. FSR is an optimisation. You'll always be able to find some circumstances where an optimisation makes things worse, but what matters is the overall result. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 7/28/2013 2:29 PM, Chris Angelico wrote: On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote: Somewhat off topic, but befitting of the triviality of this thread, do I understand correctly that you are saying garbage collection never causes any noticeable slowdown in real-world circumstances? That's not remotely true. If it's done properly, garbage collection shouldn't hurt the *overall* performance of the app; There are situations, some discussed on this list, where doing gc 'right' means turning off the cycle garbage collector. As I remember, an example is creating a list of a million tuples, which otherwise triggers a lot of useless background bookkeeping. The cyclic gc is tuned for 'normal' use patterns. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit : On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote: Good point. FSR, nice tool for those who wish to teach Unicode. It is not every day, one has such an opportunity. I had a long e-mail composed, but decided to chop it down, but still too long. so I ditched a lot of the context, which jmf also seems to do. Apologies. 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32 is an official encoding. FSR only differs from UTF-32 in that the padding zeros are stripped off such that it is stored in the most compact form that can handle all the characters in string, which is always known at string creation time. Now you can argue many things, but to say FSR is not unicode compliant is quite a stretch! What unicode entities or characters cannot be stored in strings using FSR? What sequences of bytes in FSR result in invalid Unicode entities? 2. strings in Python *never change*. They are immutable. The + operator always copies strings character by character into a new string object, even if Python had used UTF-8 internally. If you're doing a lot of string concatenations, perhaps you're using the wrong data type. A byte buffer might be better for you, where you can stuff utf-8 sequences into it to your heart's content. 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that slicing a string would be very very slow, and that's unacceptable for the use cases of python strings. I'm assuming you understand big O notation, as you talk of experience in many languages over the years. FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any variable-width encoding are always O(n). A lot slower! 4. Unicode is, well, unicode. You seem to hop all over the place from talking about code points to bytes to bits, using them all interchangeably. And now you seem to be claiming that a particular byte encoding standard is by definition unicode (UTF-8). Or at least that's how it sounds. And also claim FSR is not compliant with unicode standards, which appears to me to be completely false. Is my understanding of these things wrong? -- Compare these (a BDFL exemple, where I'using a non-ascii char) Py 3.2 (narrow build) timeit.timeit(a = 'hundred'; 'x' in a) 0.09897159682121348 timeit.timeit(a = 'hundre€'; 'x' in a) 0.09079501961732461 sys.getsizeof('d') 32 sys.getsizeof('€') 32 sys.getsizeof('dd') 34 sys.getsizeof('d€') 34 Py3.3 timeit.timeit(a = 'hundred'; 'x' in a) 0.12183182740848858 timeit.timeit(a = 'hundre€'; 'x' in a) 0.2365732969632326 sys.getsizeof('d') 26 sys.getsizeof('€') 40 sys.getsizeof('dd') 27 sys.getsizeof('d€') 42 Tell me which one seems to be more unicode compliant? The goal of Unicode is to handle every char equaly. Now, the problem: memory. Do not forget that à la FSR mechanism for a non-ascii user is *irrelevant*. As soon as one uses one single non-ascii, your ascii feature is lost. (That why we have all these dedicated coding schemes, utfs included). sys.getsizeof('abc' * 1000 + 'z') 3026 sys.getsizeof('abc' * 1000 + '\U00010010') 12044 A bit secret. The larger a repertoire of characters is, the more bits you needs. Secret #2. You can not escape from this. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le dimanche 28 juillet 2013 21:04:56 UTC+2, MRAB a écrit : On 28/07/2013 19:13, wxjmfa...@gmail.com wrote: Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit : On Sat, Jul 27, 2013 at 12:21 PM, wxjmfa...@gmail.com wrote: Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Yes, given a pointer location into a utf-8 or utf-16 string, it is easy to determine the identity of the code point at that location. But this is not often a useful operation, save for resynchronization in the case that the string data is corrupted. The caret of an editor does not conceptually correspond to a pointer location, but to a character index. Given a particular character index (e.g. 127504), an editor must be able to determine the identity and/or the memory location of the character at that index, and for UTF-8 and UTF-16 without an auxiliary data structure that is a O(n) operation. 2) Take a look at this. Get rid of the overhead. sys.getsizeof('b'*100 + 'c') 126 sys.getsizeof('b'*100 + '€') 240 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. Large strings in practical usage do not need to be resized like this often. Python 3.3 has been in production use for months now, and you still have yet to produce any real-world application code that demonstrates a performance regression. If there is no real-world regression, then there is no problem. 3) Unicode compliance. We know retrospectively, latin-1, is was a bad choice. Unusable for 17 European languages. Believe of not. 20 years of Unicode of incubation is not long enough to learn it. When discussing once with a French Python core dev, one with commit access, he did not know one can not use latin-1 for the French language! Probably because for many French strings, one can. As far as I am aware, the only characters that are missing from Latin-1 are the Euro sign (an unfortunate victim of history), the ligature œ (I have no doubt that many users just type oe anyway), and the rare capital Ÿ (the miniscule version is present in Latin-1). All French strings that are fortunate enough to be absent these characters can be represented in Latin-1 and so will have a 1-byte width in the FSR. -- latin-1? that's not even truth. sys.getsizeof('a') 26 sys.getsizeof('ü') 38 sys.getsizeof('aa') 27 sys.getsizeof('aü') 39 sys.getsizeof('aa') - sys.getsizeof('a') 1 One byte per codepoint. sys.getsizeof('üü') - sys.getsizeof('ü') 1 Also one byte per codepoint. sys.getsizeof('ü') - sys.getsizeof('a') 12 Clearly there's more going on here. FSR is an optimisation. You'll always be able to find some circumstances where an optimisation makes things worse, but what matters is the overall result. Yes, I know my examples are always wrong, never real examples. I can point long strings, I should point short strings. I point a short string (char), it is not long enough. Strings as dict keys, no the problem is in Python dict. Performance? no that's a memory issue. Memory? no, it's a question to keep perfomance. I am using this char, no you should not, it's no common. The nabla operator in TeX file, who is so stupid to use that char? Many time, I'm just mimicking 'BDFL' examples, just by replacing his ascii chars by non ascii char ;-) And so on. To be short, this is *never* the FSR, always something else. Suggestion. Start by solving all these micro-benchmarks. all the memory cases. It a good start, no? jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On 28/07/2013 20:23, wxjmfa...@gmail.com wrote: [snip] Compare these (a BDFL exemple, where I'using a non-ascii char) Py 3.2 (narrow build) Why are you using a narrow build of Python 3.2? It doesn't treat all codepoints equally (those outside the BMP can't be stored in one code unit) and, therefore, it isn't Unicode compliant! timeit.timeit(a = 'hundred'; 'x' in a) 0.09897159682121348 timeit.timeit(a = 'hundre€'; 'x' in a) 0.09079501961732461 sys.getsizeof('d') 32 sys.getsizeof('€') 32 sys.getsizeof('dd') 34 sys.getsizeof('d€') 34 Py3.3 timeit.timeit(a = 'hundred'; 'x' in a) 0.12183182740848858 timeit.timeit(a = 'hundre€'; 'x' in a) 0.2365732969632326 sys.getsizeof('d') 26 sys.getsizeof('€') 40 sys.getsizeof('dd') 27 sys.getsizeof('d€') 42 Tell me which one seems to be more unicode compliant? The goal of Unicode is to handle every char equaly. Now, the problem: memory. Do not forget that à la FSR mechanism for a non-ascii user is *irrelevant*. As soon as one uses one single non-ascii, your ascii feature is lost. (That why we have all these dedicated coding schemes, utfs included). sys.getsizeof('abc' * 1000 + 'z') 3026 sys.getsizeof('abc' * 1000 + '\U00010010') 12044 A bit secret. The larger a repertoire of characters is, the more bits you needs. Secret #2. You can not escape from this. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
Op 28-07-13 21:23, wxjmfa...@gmail.com schreef: Le dimanche 28 juillet 2013 17:52:47 UTC+2, Michael Torrie a écrit : On 07/27/2013 12:21 PM, wxjmfa...@gmail.com wrote: Good point. FSR, nice tool for those who wish to teach Unicode. It is not every day, one has such an opportunity. I had a long e-mail composed, but decided to chop it down, but still too long. so I ditched a lot of the context, which jmf also seems to do. Apologies. 1. FSR *is* UTF-32 so it is as unicode compliant as UTF-32, since UTF-32 is an official encoding. FSR only differs from UTF-32 in that the padding zeros are stripped off such that it is stored in the most compact form that can handle all the characters in string, which is always known at string creation time. Now you can argue many things, but to say FSR is not unicode compliant is quite a stretch! What unicode entities or characters cannot be stored in strings using FSR? What sequences of bytes in FSR result in invalid Unicode entities? 2. strings in Python *never change*. They are immutable. The + operator always copies strings character by character into a new string object, even if Python had used UTF-8 internally. If you're doing a lot of string concatenations, perhaps you're using the wrong data type. A byte buffer might be better for you, where you can stuff utf-8 sequences into it to your heart's content. 3. UTF-8 and UTF-16 encodings, being variable width encodings, mean that slicing a string would be very very slow, and that's unacceptable for the use cases of python strings. I'm assuming you understand big O notation, as you talk of experience in many languages over the years. FSR and UTF-32 both are O(1) for slicing and lookups. UTF-8, 16 and any variable-width encoding are always O(n). A lot slower! 4. Unicode is, well, unicode. You seem to hop all over the place from talking about code points to bytes to bits, using them all interchangeably. And now you seem to be claiming that a particular byte encoding standard is by definition unicode (UTF-8). Or at least that's how it sounds. And also claim FSR is not compliant with unicode standards, which appears to me to be completely false. Is my understanding of these things wrong? -- Compare these (a BDFL exemple, where I'using a non-ascii char) Py 3.2 (narrow build) timeit.timeit(a = 'hundred'; 'x' in a) 0.09897159682121348 timeit.timeit(a = 'hundre€'; 'x' in a) 0.09079501961732461 sys.getsizeof('d') 32 sys.getsizeof('€') 32 sys.getsizeof('dd') 34 sys.getsizeof('d€') 34 Py3.3 timeit.timeit(a = 'hundred'; 'x' in a) 0.12183182740848858 timeit.timeit(a = 'hundre€'; 'x' in a) 0.2365732969632326 sys.getsizeof('d') 26 sys.getsizeof('€') 40 sys.getsizeof('dd') 27 sys.getsizeof('d€') 42 Tell me which one seems to be more unicode compliant? Cant tell, you give no relevant information on which one can decide this question. The goal of Unicode is to handle every char equaly. Not to this kind of detail, which is looking at irrelevant implementation details. Now, the problem: memory. Do not forget that à la FSR mechanism for a non-ascii user is *irrelevant*. As soon as one uses one single non-ascii, your ascii feature is lost. (That why we have all these dedicated coding schemes, utfs included). So? Why should that trouble me? As far as I understand whether I have an ascii string or not is totally irrelevant to the application programmer. Within the application I just process strings and let the programming environment keep track of these details in a transparant way unless you start looking at things like getsizeof, which gives you implementation details that are mostly irrelevant in deciding whether the behaviour is compliant or not. sys.getsizeof('abc' * 1000 + 'z') 3026 sys.getsizeof('abc' * 1000 + '\U00010010') 12044 A bit secret. The larger a repertoire of characters is, the more bits you needs. Secret #2. You can not escape from this. And totally unimportant for deciding complyance. -- Antoon Pardon -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
wxjmfa...@gmail.com writes: Suggestion. Start by solving all these micro-benchmarks. all the memory cases. It a good start, no? Since you seem the only one who has this dramatic problem with such micro-benchmarks, that BTW have nothing to do with unicode compliance, I'd suggest *you* should find a better implementation and propose it to the core devs. An even better suggestion, with due respect, is to get a life and find something more interesting to do, or at least better arguments :-) ciao, lele. -- nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia. l...@metapensiero.it | -- Fortunato Depero, 1929. -- http://mail.python.org/mailman/listinfo/python-list
Re: FSR and unicode compliance - was Re: RE Module Performance
On Sun, 28 Jul 2013 12:23:04 -0700, wxjmfauth wrote: Do not forget that à la FSR mechanism for a non-ascii user is *irrelevant*. You have been told repeatedly, Python's internals are *full* of ASCII- only strings. py dir(list) ['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'clear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] There's 45 ASCII-only strings right there, in only one built-in type, out of dozens. There are dozens, hundreds of ASCII-only strings in Python: builtin functions and classes, attributes, exceptions, internal attributes, variable names, and so on. You already know this, and yet you persist in repeating nonsense. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 28 July 2013 19:29, Chris Angelico ros...@gmail.com wrote: On Sun, Jul 28, 2013 at 7:19 PM, Joshua Landau jos...@landau.ws wrote: On 28 July 2013 09:45, Antoon Pardon antoon.par...@rece.vub.ac.be wrote: Op 27-07-13 20:21, wxjmfa...@gmail.com schreef: utf-8 or any (utf) never need and never spend their time in reencoding. So? That python sometimes needs to do some kind of background processing is not a problem, whether it is garbage collection, allocating more memory, shufling around data blocks or reencoding a string, that doesn't matter. If you've got a real world example where one of those things noticeably slows your program down or makes the program behave faulty then you have something that is worthy of attention. Somewhat off topic, but befitting of the triviality of this thread, do I understand correctly that you are saying garbage collection never causes any noticeable slowdown in real-world circumstances? That's not remotely true. If it's done properly, garbage collection shouldn't hurt the *overall* performance of the app; most of the issues with GC timing are when one operation gets unexpectedly delayed for a GC run (making performance measurement hard, and such). It should certainly never cause your program to behave faultily, though I have seen cases where the GC run appears to cause the program to crash - something like this: some_string = buggy_call() ... gc() ... print(some_string) The buggy call mucked up the reference count, so the gc run actually wiped the string from memory - resulting in a segfault on next usage. But the GC wasn't at fault, the original call was. (Which, btw, was quite a debugging search, especially since the function in question wasn't my code.) GC does have sometimes severe impact in memory-constrained environments, though. See http://sealedabstract.com/rants/why-mobile-web-apps-are-slow/, about half-way down, specifically http://sealedabstract.com/wp-content/uploads/2013/05/Screen-Shot-2013-05-14-at-10.15.29-PM.png . The best verification of these graphs I could find was https://blog.mozilla.org/nnethercote/category/garbage-collection/, although it's not immediately clear in Chrome's and Opera's case mainly due to none of the benchmarks pushing memory usage significantly. I also don't quite agree with the first post (sealedabstract) because I get by *fine* on 2GB memory, so I don't see why you can't on a phone. Maybe IOS is just really heavy. Nonetheless, the benchmarks aren't lying. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote: BTW, I'm pleased to read sequence of bits and not bytes. Again, utf transformers are producing sequence of bits, call Unicode Transformation Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32. UCS transformers are (were) producing bytes, from there the names ucs-2/4. Not only does your distinction between bits and bytes make no practical difference on nearly all hardware in common use today[1], but the Unicode Consortium disagrees with you, and defines UTC in terms of bytes: A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. http://www.unicode.org/faq/utf_bom.html#gen2 [1] There may still be some old supercomputers where a byte is more than 8 bits in use, but they're unlikely to support Unicode. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit : On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote: sys.getsizeof('––') - sys.getsizeof('–') I have already explained / commented this. Maybe it got lost in translation, but I don't understand your point with that. Hint: To understand Unicode (and every coding scheme), you should understand utf. The how and the *why*. Hmm, so if python used utf-8 internally to represent unicode strings would not that punish *all* users (not just non-ascii users) since searching a string for a certain character position requires an O(n) operation? UTF-32 I could see (and indeed that's essentially what FSR uses when necessary does it not?), but not utf-8 or utf-16. -- Did you read my previous link? Unicode Character Encoding Model. Did you understand it? Unicode only - No FSR (I skip some points and I still attempt to be still correct.) Unicode is a four-steps process. [ {unique set of characters} -- {unique set of code points, the labels} -- {unique set of encoded code points} ] -- implementation (bytes) First point to notice. pure unicode, [...], is different from the implementation. *This is a deliberate choice*. The critical step is the path {unique set of characters} --- {unique set of encoded code points} in such a way so that the implementation can work comfortably with this *unique* set of encoded code points. Conceptualy, the implementation works with an unique set of already prepared encoded code points. This is a very critical step. To explain it in a dirty way: in the above chain, this problem is already eliminated and solved. Like a byte/char coding schemes where this step is a no-op. Now, and if you wish this is a seperated/different problem. To create this unique set of encoded code points, Unicode uses these utf(s). I repeat again, a confusing name, for the process and the result of the process. (I neglect ucs). What are these? Chunks of bits, group of 8/16/32 bits, words. It is up to the implementation to convert these sequences of bits into bytes, ***if you wish to convert these in bytes!***. Suprise! Why not putting two of the 32-bits words in a 64-bits machine? (see golang / rune / int32). Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Next step and one another separated problem. Why all these utf versions? It is always the same story. Some prefer the universality (utf-32) and some prefer, well, some kind of conservatism. utf-8 is more complicated, it demands more work and logically, in an expected way, some performance regression. utf-8 is more suited to produce bytes, utf16/32 for internal processing. utf-8 had no choice to lose the indexing. And so on. Fact: all these coding schemes are working with a unique set of encoded code points (suprise again, it's like byte string!). The loss of performance of utf-8 is very minimal compared to the loss of performance one can get compare to a multiple coding scheme. This kind of work has been done, and if my informations are correct, even by the creators of utf-8. (There are sometimes good scientists). There are plenty of advantages in using utf instead of something else and advantages in other fields than just the pure coding. utf-16/32 schemes have the advantages to ditch ascii for ever. The ascii concept is no more existing. One should also understand that all this stuff has not been created from scratch. It was a balance between existing technologies. MS sticked with the idea, no more ascii, let's use ucs-2 and the *x world breaks the unicode adoption as possible. utf-8 is one of the compromise for the adoption of Unicode. Retrospectivly, a not so good compromise. Computer scientists are funny scientists. They do love to solve the problems they created themselves. - Quickly. sys.getsizeof() at the light of what I explained. 1) As this FSR works with multiple encoding, it has to keep track of the encoding. it puts is in the overhead of str class (overhead = real overhead + encoding). In such a absurd way, that a sys.getsizeof('€') 40 needs 14 bytes more than a sys.getsizeof('z') 26 You may vary the length of the str. The problem is still here. Not bad for a coding scheme. 2) Take
Re: RE Module Performance
On Sat, Jul 27, 2013 at 12:21 PM, wxjmfa...@gmail.com wrote: Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each utf chunk holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any programmed information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Yes, given a pointer location into a utf-8 or utf-16 string, it is easy to determine the identity of the code point at that location. But this is not often a useful operation, save for resynchronization in the case that the string data is corrupted. The caret of an editor does not conceptually correspond to a pointer location, but to a character index. Given a particular character index (e.g. 127504), an editor must be able to determine the identity and/or the memory location of the character at that index, and for UTF-8 and UTF-16 without an auxiliary data structure that is a O(n) operation. 2) Take a look at this. Get rid of the overhead. sys.getsizeof('b'*100 + 'c') 126 sys.getsizeof('b'*100 + '€') 240 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. Large strings in practical usage do not need to be resized like this often. Python 3.3 has been in production use for months now, and you still have yet to produce any real-world application code that demonstrates a performance regression. If there is no real-world regression, then there is no problem. 3) Unicode compliance. We know retrospectively, latin-1, is was a bad choice. Unusable for 17 European languages. Believe of not. 20 years of Unicode of incubation is not long enough to learn it. When discussing once with a French Python core dev, one with commit access, he did not know one can not use latin-1 for the French language! Probably because for many French strings, one can. As far as I am aware, the only characters that are missing from Latin-1 are the Euro sign (an unfortunate victim of history), the ligature œ (I have no doubt that many users just type oe anyway), and the rare capital Ÿ (the miniscule version is present in Latin-1). All French strings that are fortunate enough to be absent these characters can be represented in Latin-1 and so will have a 1-byte width in the FSR. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le jeudi 25 juillet 2013 22:45:38 UTC+2, Ian a écrit : On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. Except that the described representation used by Emacs is a variant of UTF-8, not an FSR. It doesn't have three different possible encodings for the letter 'a' depending on what other characters happen to be in the string. As I understand it, jfm would be perfectly happy if Python used UTF-8 (or presumably the Emacs variant) as its internal string representation. -- And emacs it probably working smoothly. Your comment summarized all this stuff very correctly and very shortly. utf8/16/32? I do not care. There are all working correctly, smoothly and efficiently. In fact, these utf's are already doing correctly, what this FSR is doing in a wrong way. My preference? utf32. Why? It is the most simple and consequently performing choice. I'm not a narrow minded ascii user. (I do not pretend to belong to those who are solving the quadrature of the circle, I pretend to belong to those who know, the quadrature of the circle is not solvable). Note: text processing tools or tools that have to process characters — and the tools to build these tools — are all moving to utf32, if not already done. There are technical reasons behind this, which are going beyond the pure raw unicode. There are however still 100% Unicode compliant. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le vendredi 26 juillet 2013 05:09:34 UTC+2, Michael Torrie a écrit : On 07/25/2013 11:18 AM, Steven D'Aprano wrote: JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. Now I'm even more confused. He once pointed to Go as an example of how unicode should be done in a language. yet Go uses UTF-8 I think. But I don't think UTF-8 is what JMF refers to as flexible string representation. FSR does use 1,2 or 4 bytes per character, but each character in the string uses the same width. That's different from UTF-8 or UTF-16, which is variable width per character. - sys.getsizeof('––') - sys.getsizeof('–') I have already explained / commented this. Hint: To understand Unicode (and every coding scheme), you should understand utf. The how and the *why*. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit : On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. UTF-8 does not use a flexible representation. A codec that is encoding a string in UTF-8 and examining a particular character does not have any choice of how to encode that character; there is exactly one sequence of bits that is the UTF-8 encoding for the character. Further, for any given sequence of code points there is exactly one sequence of bytes that is the UTF-8 encoding of those code points. In contrast, with the FSR there are as many as three different sequences of bytes that encode a sequence of code points, with one of them (the shortest) being canonical. That's what makes it flexible. Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or something like the FSR. BTW, it is not necessary to use an endorsed Unicode coding scheme (utf*), a string literal would have been possible, but then one falls on memory issures. All these utf are following the basic coding scheme. I repeat again. A coding scheme works with a unique set of characters and its implementation works with a unique set of encoded code points (the utf's, in case of Unicode). And again, that why we live today with all these coding schemes, or, to take the problem from the other side, that's because one has to work with a unique set of encoded code points, that all these coding schemes had to be created. utf's have not been created by newbies ;-) jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le vendredi 26 juillet 2013 05:20:45 UTC+2, Ian a écrit : On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. UTF-8 does not use a flexible representation. A codec that is encoding a string in UTF-8 and examining a particular character does not have any choice of how to encode that character; there is exactly one sequence of bits that is the UTF-8 encoding for the character. Further, for any given sequence of code points there is exactly one sequence of bytes that is the UTF-8 encoding of those code points. In contrast, with the FSR there are as many as three different sequences of bytes that encode a sequence of code points, with one of them (the shortest) being canonical. That's what makes it flexible. Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or something like the FSR. - Let's be clear. I'm perfectly understanding what is utf-8 and that's for that precise reason, I put the editor as an exemple on the table. This FSR is not *a* coding scheme. It is more a composite coding scheme. (And form there, all the problems). BTW, I'm pleased to read sequence of bits and not bytes. Again, utf transformers are producing sequence of bits, call Unicode Transformation Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32. UCS transformers are (were) producing bytes, from there the names ucs-2/4. jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote: sys.getsizeof('––') - sys.getsizeof('–') I have already explained / commented this. Maybe it got lost in translation, but I don't understand your point with that. Hint: To understand Unicode (and every coding scheme), you should understand utf. The how and the *why*. Hmm, so if python used utf-8 internally to represent unicode strings would not that punish *all* users (not just non-ascii users) since searching a string for a certain character position requires an O(n) operation? UTF-32 I could see (and indeed that's essentially what FSR uses when necessary does it not?), but not utf-8 or utf-16. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 21:20:45 -0600, Ian Kelly wrote: On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. UTF-8 does not use a flexible representation. I disagree, and so does Jeremy Sanders who first pointed out the similarity between Emacs' UTF-8 and Python's FSR. I'll quote from the Emacs documentation again: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. And the Python FSR: To conserve memory, Python does not hold fixed-length 21-bit numbers that are codepoints of text characters within buffers and strings. Rather, Python uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 4 8-bit bytes, depending on the magnitude of the largest codepoint in the string. For example, any all-ASCII or all-Latin1 string takes up only 1 byte per character, an all- BMP string takes up 2 bytes per character, etc. See the similarity now? Both flexibly change the width used by code- points, UTF-8 based on the code-point itself regardless of the rest of the string, Python based on the largest code-point in the string. [...] Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or something like the FSR. Whether JMF can see the similarities between different implementations of strings or not is beside the point, those similarities do exist. As do the differences, of course, but in this case the differences are in favour of Python's FSR. Even if your string is entirely Latin1, a UTF-8 implementation *cannot know that*, and still has to walk the string byte- by-byte checking whether the current code point requires 1, 2, 3, or 4 bytes, while a FSR implementation can simply record the fact that the string is pure Latin1 at creation time, and then treat it as fixed-width from then on. JMF claims that FSR is impossible to use efficiently, and yet he supports encoding schemes which are *less* efficient. Go figure. He tells us he has no problem with any of the established UTF encodings, and yet the FSR internally uses UTF-16 and UTF-32. (Technically, it's UCS-2, not UTF-16, since there are no surrogate pairs. But the difference is insignificant.) Having watched this issue from Day One when JMF first complained about it, I believe this is entirely about denying any benefit to ASCII users. Had Python implemented a system identical to the current FSR except that it added a fourth category, all ASCII, which used an eight-byte encoding scheme (thus making ASCII strings twice as expensive as strings including code points from the Supplementary Multilingual Planes), JMF would be the scheme's number one champion. I cannot see any other rational explanation for why JMF prefers broken, buggy Unicode implementations, or implementations which are equally expensive for all strings, over one which is demonstrably correct, demonstrably saves memory, and for realistic, non-contrived benchmarks, demonstrably faster, except that he wants to punish ASCII users more than he wants to support Unicode users. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: See the similarity now? Both flexibly change the width used by code- points, UTF-8 based on the code-point itself regardless of the rest of the string, Python based on the largest code-point in the string. No, I think we're just using the word flexible differently. In my view, simply being variable-width does not make an encoding flexible in the sense of the FSR. But I'm not going to keep repeating myself in order to argue about it. Having watched this issue from Day One when JMF first complained about it, I believe this is entirely about denying any benefit to ASCII users. Had Python implemented a system identical to the current FSR except that it added a fourth category, all ASCII, which used an eight-byte encoding scheme (thus making ASCII strings twice as expensive as strings including code points from the Supplementary Multilingual Planes), JMF would be the scheme's number one champion. I agree. In fact I made a similar observation back in December: http://mail.python.org/pipermail/python-list/2012-December/636942.html -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, 26 Jul 2013 22:12:36 -0600, Ian Kelly wrote: On Fri, Jul 26, 2013 at 9:37 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: See the similarity now? Both flexibly change the width used by code- points, UTF-8 based on the code-point itself regardless of the rest of the string, Python based on the largest code-point in the string. No, I think we're just using the word flexible differently. In my view, simply being variable-width does not make an encoding flexible in the sense of the FSR. But I'm not going to keep repeating myself in order to argue about it. But I paid for the full half hour! http://en.wikipedia.org/wiki/The_Argument_Sketch -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Wed, 24 Jul 2013 09:00:39 -0600, Michael Torrie wrote about JMF: His most recent argument that Python should use UTF as a representation is very strange to be honest. He's not arguing for anything, he is just hating on anything that gives even the tiniest benefit to ASCII users. This isn't about Python 3.3. hurting non-ASCII users, because that is demonstrably untrue: they are *better off* in Python 3.3. This is about denying even a tiny benefit to ASCII users. In Python 3.3, non-ASCII users have these advantages compared to previous versions: - strings will usually take less memory, and aside from trivial changes to the object header, they never take more memory than a wide build would use; - consequently nearly all objects will take less memory (especially builtins and standard library objects, which are all ASCII), since objects contain dozens of internal strings (attribute and method names in __dict__, class name, etc.); - consequently whole-application benchmarks show most applications will use significantly less memory, which leads to faster speeds; - you cannot break surrogate pairs apart by accident, which you can do in narrow builds; - in previous versions, code which works when run in a wide build may fail in a narrow build, but that is no longer an issue since the distinction between wide and narrow builds is gone; - Latin1 users, which includes JMF himself, will likewise see memory savings, since Latin1 strings will take half the size of narrow builds and a quarter the size of wide builds. The cost of all these benefits is a small overhead when creating a string in the first place, and some purely internal added complication to the string implementation. I'm the first to argue against complication unless there is a corresponding benefit. This is a case where the benefit has proven itself doubly: Python 3.3's Unicode implementation is *more correct* than before, and it uses less memory to do so. The cons of UTF are apparent and widely known. The main con is that UTF strings are O(n) for indexing a position within the string. Not so for UTF-32. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 3:49 PM, Serhiy Storchaka storch...@gmail.com wrote: 24.07.13 21:15, Chris Angelico написав(ла): To my mind, exposing UTF-16 surrogates to the application is a bug to be fixed, not a feature to be maintained. Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates area) to represent undecodable bytes with surrogateescape error handler. That's a deliberate and conscious use of the codepoints; that's not what I'm talking about here. Suppose you read a UTF-8 stream of bytes from a file, and decode them into your language's standard string type. At this point, you should be working with a string of Unicode codepoints: \22\341\210\264\360\222\215\205 -- \x12\u1234\U00012345 The incoming byte stream has a length of 8, the resulting character stream has a length of 3. Now, if the language wants to use UTF-16 internally, it's free to do so: 0012 1234 d808 df45 When I referred to exposing surrogates to the application, this is what I'm talking about. If decoding the above byte stream results in a length 4 string where the last two are \xd808 and \xdf45, then it's exposing them. If it's a length 3 string where the last is \U00012345, then it's hiding them. To be honest, I don't imagine I'll ever see a language that stores strings in UTF-16 and then exposes them to the application as UTF-32; there's very little point. But such *is* possible, and if it's working closely with libraries that demand UTF-16, it might well make sense to do things that way. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote: But mainly, I'm just wondering how many people here have any basis from which to argue the point he's trying to make. I doubt most of us have (a) implemented an editor widget, or (b) tested multiple different internal representations to learn the true pros and cons of each. And even if any of us had, that still wouldn't have any bearing on PEP 393, which is about applications, not editor widgets. As stated above, Python strings before AND after PEP 393 are poor choices for an editor, ergo arguing from that standpoint is pretty useless. That's a misleading way to put it. Using immutable strings as editor buffers might be a bad way to implement all but the most trivial, low- performance (i.e. slow) editor, but the basic concept of PEP 393, picking an internal representation of the text based on its contents, is not. That's just normal. The only difference with PEP 393 is that the choice is made on the fly, at runtime, instead of decided in advance by the programmer. I expect that the PEP 393 concept of optimizing memory per string buffer would work well in an editor. However the internal buffer is arranged, you can safely assume that each chunk of text (word, sentence, paragraph, buffer...) will very rarely shift from all Latin 1 to all BMP to includes SMP chars. So, for example, entering a SMP character will need to immediately up-cast the chunk from 1-byte per char to 4-bytes per char, which is relatively pricey, but it's a one-off cost. Down-casting when the SMP character is deleted doesn't need to be done immediately, it can be performed when the application is idle. If the chunks are relatively small (say, a paragraph rather than multiple pages of text) then even that initial conversion will be invisible. A fast touch typist hits a key about every 0.1 of a second; if it takes a millisecond to convert the chunk, you wouldn't even notice the delay. You can copy and up-cast a lot of bytes in a millisecond. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote: If nobody had ever thought of doing a multi-format string representation, I could well imagine the Python core devs debating whether the cost of UTF-32 strings is worth the correctness and consistency improvements... and most likely concluding that narrow builds get abolished. And if any other language (eg ECMAScript) decides to move from UTF-16 to UTF-32, I would wholeheartedly support the move, even if it broke code to do so. Unfortunately, so long as most language designers are European-centric, there is going to be a lot of push-back against any attempt to fix (say) Javascript, or Java just for the sake of a bunch of dead languages in the SMPs. Thank goodness for emoji. Wait til the young kids start complaining that their emoticons and emoji are broken in Javascript, and eventually it will get fixed. It may take a decade, for the young kids to grow up and take over Javascript from the old-codgers, but it will happen. To my mind, exposing UTF-16 surrogates to the application is a bug to be fixed, not a feature to be maintained. This, times a thousand. It is *possible* to have non-buggy string routines using UTF-16, but the implementation is a lot more complex than most language developers can be bothered with. I'm not aware of any language that uses UTF-16 internally that doesn't give wrong results for surrogate pairs. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 5:02 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 00:34:24 +1000, Chris Angelico wrote: But mainly, I'm just wondering how many people here have any basis from which to argue the point he's trying to make. I doubt most of us have (a) implemented an editor widget, or (b) tested multiple different internal representations to learn the true pros and cons of each. And even if any of us had, that still wouldn't have any bearing on PEP 393, which is about applications, not editor widgets. As stated above, Python strings before AND after PEP 393 are poor choices for an editor, ergo arguing from that standpoint is pretty useless. That's a misleading way to put it. Using immutable strings as editor buffers might be a bad way to implement all but the most trivial, low- performance (i.e. slow) editor, but the basic concept of PEP 393, picking an internal representation of the text based on its contents, is not. That's just normal. The only difference with PEP 393 is that the choice is made on the fly, at runtime, instead of decided in advance by the programmer. Maybe I worded it poorly, but my point was the same as you're saying here: that a Python string is a poor buffer for editing, regardless of PEP 393. It's not that PEP 393 makes Python strings worse for writing a text editor, it's that immutability does that. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote: If nobody had ever thought of doing a multi-format string representation, I could well imagine the Python core devs debating whether the cost of UTF-32 strings is worth the correctness and consistency improvements... and most likely concluding that narrow builds get abolished. And if any other language (eg ECMAScript) decides to move from UTF-16 to UTF-32, I would wholeheartedly support the move, even if it broke code to do so. Unfortunately, so long as most language designers are European-centric, there is going to be a lot of push-back against any attempt to fix (say) Javascript, or Java just for the sake of a bunch of dead languages in the SMPs. Thank goodness for emoji. Wait til the young kids start complaining that their emoticons and emoji are broken in Javascript, and eventually it will get fixed. It may take a decade, for the young kids to grow up and take over Javascript from the old-codgers, but it will happen. I don't know that that'll happen like that. Emoticons aren't broken in Javascript - you can use them just fine. You only start seeing problems when you index into that string. People will start to wonder why, for instance, a 500 character maximum field deducts two from the limit when an emoticon goes in. Example: Type here:brtextarea id=content oninput=showlimit(this)/textarea brYou have span id=limit1500/span characters left (self.value.length). brYou have span id=limit2500/span characters left (self.textLength). script function showlimit(self) { document.getElementById(limit1).innerHTML=500-self.value.length; document.getElementById(limit2).innerHTML=500-self.textLength; } /script I've included an attribute documented here[1] as the codepoint length of the control's value, but in Chrome on Windows, it still counts UTF-16 code units. However, I very much doubt that this will result in language changes. People will just live with it. Chinese and Japanese users will complain, perhaps, and the developers will write it off as whinging, and just say That's what the internet does. Maybe, if you're really lucky, they'll acknowledge that that's what JavaScript does, but even then I doubt it'd result in language changes. To my mind, exposing UTF-16 surrogates to the application is a bug to be fixed, not a feature to be maintained. This, times a thousand. It is *possible* to have non-buggy string routines using UTF-16, but the implementation is a lot more complex than most language developers can be bothered with. I'm not aware of any language that uses UTF-16 internally that doesn't give wrong results for surrogate pairs. The problem isn't the underlying representation, the problem is what gets exposed to the application. Once you've decided to expose codepoints to the app (abstracting over your UTF-16 underlying representation), the change to using UTF-32, or mimicking PEP 393, or some other structure, is purely internal and an optimization. So I doubt any language will use UTF-16 internally and UTF-32 to the app. It'd be needlessly complex. ChrisA [1] https://developer.mozilla.org/en-US/docs/Web/API/HTMLTextAreaElement -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 17:58:10 +1000, Chris Angelico wrote: On Thu, Jul 25, 2013 at 5:15 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 04:15:42 +1000, Chris Angelico wrote: If nobody had ever thought of doing a multi-format string representation, I could well imagine the Python core devs debating whether the cost of UTF-32 strings is worth the correctness and consistency improvements... and most likely concluding that narrow builds get abolished. And if any other language (eg ECMAScript) decides to move from UTF-16 to UTF-32, I would wholeheartedly support the move, even if it broke code to do so. Unfortunately, so long as most language designers are European-centric, there is going to be a lot of push-back against any attempt to fix (say) Javascript, or Java just for the sake of a bunch of dead languages in the SMPs. Thank goodness for emoji. Wait til the young kids start complaining that their emoticons and emoji are broken in Javascript, and eventually it will get fixed. It may take a decade, for the young kids to grow up and take over Javascript from the old-codgers, but it will happen. I don't know that that'll happen like that. Emoticons aren't broken in Javascript - you can use them just fine. You only start seeing problems when you index into that string. People will start to wonder why, for instance, a 500 character maximum field deducts two from the limit when an emoticon goes in. I get that. I meant *Javascript developers*, not end-users. The young kids today who become Javascript developers tomorrow will grow up in a world where they expect to be able to write band names like ▼□■□■□■ (yes, really, I didn't make that one up) and have it just work. Okay, all those characters are in the BMP, but emoji aren't, and I guarantee that even as we speak some new hipster band is trying to decide whether to name themselves Smiling or Crying . :-) It is *possible* to have non-buggy string routines using UTF-16, but the implementation is a lot more complex than most language developers can be bothered with. I'm not aware of any language that uses UTF-16 internally that doesn't give wrong results for surrogate pairs. The problem isn't the underlying representation, the problem is what gets exposed to the application. Once you've decided to expose codepoints to the app (abstracting over your UTF-16 underlying representation), the change to using UTF-32, or mimicking PEP 393, or some other structure, is purely internal and an optimization. So I doubt any language will use UTF-16 internally and UTF-32 to the app. It'd be needlessly complex. To be honest, I don't understand what you are trying to say. What I'm trying to say is that it is possible to use UTF-16 internally, but *not* assume that every code point (character) is represented by a single 2-byte unit. For example, the len() of a UTF-16 string should not be calculated by counting the number of bytes and dividing by two. You actually need to walk the string, inspecting each double-byte: # calculate length count = 0 inside_surrogate = False for bb in buffer: # get two bytes at a time if is_lower_surrogate(bb): inside_surrogate = True continue if is_upper_surrogate(bb): if inside_surrogate: count += 1 inside_surrogate = False continue raise ValueError(missing lower surrogate) if inside_surrogate: break count += 1 if inside_surrogate: raise ValueError(missing upper surrogate) Given immutable strings, you could validate the string once, on creation, and from then on assume they are well-formed: # calculate length, assuming the string is well-formed: count = 0 skip = False for bb in buffer: # get two bytes at a time if skip: count += 1 skip = False continue if is_surrogate(bb): skip = True count += 1 String operations such as slicing become much more complex once you can no longer assume a 1:1 relationship between code points and code units, whether they are 1, 2 or 4 bytes. Most (all?) language developers don't handle that complexity, and push responsibility for it back onto the coder using the language. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit : On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote: Sorry, you are not understanding Unicode. What is a Unicode Transformation Format (UTF), what is the goal of a UTF and why it is important for an implementation to work with a UTF. Really? Enlighten me. Personally, I would never use UTF as a representation *in memory* for a unicode string if it were up to me. Why? Because UTF characters are not uniform in byte width so accessing positions within the string is terribly slow and has to always be done by starting at the beginning of the string. That's at minimum O(n) compared to FSR's O(1). Surely you understand this. Do you dispute this fact? UTF is a great choice for interchange, though, and indeed that's what it was designed for. Are you calling for UTF to be adopted as the internal, in-memory representation of unicode? Or would you simply settle for UCS-4? Please be clear here. What are you saying? Short example. Writing an editor with something like the FSR is simply impossible (properly). How? FSR is just an implementation detail. It could be UCS-4 and it would also work. - A coding scheme works with a unique set of characters (the repertoire), and the implementation (the programming) works with a unique set of encoded code points. The critical step is the path {unique set of characters} -- {unique set of encoded code points} Fact: there is no other way to do it properly (This is explaining why we have to live today with all these coding schemes or also explaining why so many coding schemes hadto be created). How to understand it? With a sheet of paper and a pencil. In the byte string world, this step is a no-op. In Unicode, it is exactly the purpose of a utf to achieve this step. utf: a confusing name covering at the same time the process and the result of the process. A utf chunk, a series of bits (not bytes), hold intrisically the information about the character it is representing. Other exotic coding schemes like iso6937 of CID-fonts are woking in the same way. Unicode with the help of utf(s) does not differ from the basic rule. - ucs-2: ucs-2 is a perfecly and correctly working coding scheme. ucs-2 is not different from the other coding schemes and does not behave differently (cp... or iso-... or ...). It only covers a smaller repertoire. - utf32: as a pointed many times. You are already using it (maybe without knowing it). Where? in fonts (OpenType technology), rendering engines, pdf files. Why? Because there is not other way to do it better. -- The Unicode table (its constuction) is a problem per se. It is not a technical problem, a very important linguistic aspect of Unicode. See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 -- If you are not understanding my editor analogy. One other proposed exercise. Build/create a flexible iso-8859-X coding scheme. You will quickly understand where the bottleneck is. Two working ways: - stupidly with an editor and your fingers. - lazily with a sheet of paper and you head. About my benchmarks: No offense. You are not understanding them, because you do not understand what this FSR does and the coding of characters. It's a little bit a devil's circle. Conceptually, this FSR is spending its time in solving the problem it creates itsself, with plenty of side effects. - There is a clear difference between FSR and ucs-4/utf32. - See also: http://www.unicode.org/reports/tr17/ (In my mind, quite dry and not easy to understand at a first reading). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 7:22 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: What I'm trying to say is that it is possible to use UTF-16 internally, but *not* assume that every code point (character) is represented by a single 2-byte unit. For example, the len() of a UTF-16 string should not be calculated by counting the number of bytes and dividing by two. You actually need to walk the string, inspecting each double-byte Anything's possible. But since underlying representations can be changed fairly easily (relative term of course - it's a lot of work, but it can be changed in a single release, no deprecation required or anything), there's very little reason to continue using UTF-16 underneath. May as well switch to UTF-32 for convenience, or PEP 393 for convenience and efficiency, or maybe some other system that's still mostly fixed-width. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 7:27 PM, wxjmfa...@gmail.com wrote: A coding scheme works with a unique set of characters (the repertoire), and the implementation (the programming) works with a unique set of encoded code points. The critical step is the path {unique set of characters} -- {unique set of encoded code points} That's called Unicode. It maps the character 'A' to the code point U+0041 and so on. Code points are integers. In fact, they are very well represented in Python that way (also in Pike, fwiw): ord('A') 65 chr(65) 'A' chr(123456) '\U0001e240' ord(_) 123456 In the byte string world, this step is a no-op. In Unicode, it is exactly the purpose of a utf to achieve this step. utf: a confusing name covering at the same time the process and the result of the process. A utf chunk, a series of bits (not bytes), hold intrisically the information about the character it is representing. No, now you're looking at another level: how to store codepoints in memory. That demands that they be stored as bits and bytes, because PC memory works that way. utf32: as a pointed many times. You are already using it (maybe without knowing it). Where? in fonts (OpenType technology), rendering engines, pdf files. Why? Because there is not other way to do it better. And UTF-32 is an excellent system... as long as you're okay with spending four bytes for every character. See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 I refuse to click this link. Give us a link to the python-list@python.org archive, or gmane, or something else more suited to the audience. I'm not going to Google Groups just to figure out what you're saying. If you are not understanding my editor analogy. One other proposed exercise. Build/create a flexible iso-8859-X coding scheme. You will quickly understand where the bottleneck is. Two working ways: - stupidly with an editor and your fingers. - lazily with a sheet of paper and you head. What has this to do with the editor? There is a clear difference between FSR and ucs-4/utf32. Yes. Memory usage. PEP 393 strings might take up half or even a quarter of what they'd take up in fixed UTF-32. Other than that, there's no difference. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. ... [1] This internal representation is based on one of the encodings defined by the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8- bit bytes and characters not unified with Unicode. Jeremy -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/25/2013 09:36 AM, Jeremy Sanders wrote: wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). http://www.gnu.org/software/emacs/manual/html_node/elisp/Text-Representations.html#Text-Representations To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. ... [1] This internal representation is based on one of the encodings defined by the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8- bit bytes and characters not unified with Unicode. Jeremy Wow! The thread that I started has changed a lot and lived a long time. I look forward to its first birthday (^u^). Devyn Collier Johnson -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). http://www.gnu.org/software/emacs/manual/html_node/elisp/Text- Representations.html#Text-Representations To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. [1] This internal representation is based on one of the encodings defined by the Unicode Standard, called UTF-8, for representing any Unicode codepoint, but Emacs extends UTF-8 to represent the additional codepoints it uses for raw 8- bit bytes and characters not unified with Unicode. Do you know what those characters not unified with Unicode are? Is there a list somewhere? I've read all of the pages from here to no avail: http://www.gnu.org/software/emacs/manual/html_node/elisp/Non_002dASCII-Characters.html -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, Jul 26, 2013 at 3:18 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. Quad Error Demonstrated. I never got past the level of Canis Latinicus in debating class. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le jeudi 25 juillet 2013 12:14:46 UTC+2, Chris Angelico a écrit : On Thu, Jul 25, 2013 at 7:27 PM, wxjmfa...@gmail.com wrote: A coding scheme works with a unique set of characters (the repertoire), and the implementation (the programming) works with a unique set of encoded code points. The critical step is the path {unique set of characters} -- {unique set of encoded code points} That's called Unicode. It maps the character 'A' to the code point U+0041 and so on. Code points are integers. In fact, they are very well represented in Python that way (also in Pike, fwiw): ord('A') 65 chr(65) 'A' chr(123456) '\U0001e240' ord(_) 123456 In the byte string world, this step is a no-op. In Unicode, it is exactly the purpose of a utf to achieve this step. utf: a confusing name covering at the same time the process and the result of the process. A utf chunk, a series of bits (not bytes), hold intrisically the information about the character it is representing. No, now you're looking at another level: how to store codepoints in memory. That demands that they be stored as bits and bytes, because PC memory works that way. utf32: as a pointed many times. You are already using it (maybe without knowing it). Where? in fonts (OpenType technology), rendering engines, pdf files. Why? Because there is not other way to do it better. And UTF-32 is an excellent system... as long as you're okay with spending four bytes for every character. See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0 I refuse to click this link. Give us a link to the python-list@python.org archive, or gmane, or something else more suited to the audience. I'm not going to Google Groups just to figure out what you're saying. If you are not understanding my editor analogy. One other proposed exercise. Build/create a flexible iso-8859-X coding scheme. You will quickly understand where the bottleneck is. Two working ways: - stupidly with an editor and your fingers. - lazily with a sheet of paper and you head. What has this to do with the editor? There is a clear difference between FSR and ucs-4/utf32. Yes. Memory usage. PEP 393 strings might take up half or even a quarter of what they'd take up in fixed UTF-32. Other than that, there's no difference. ChrisA Let start with a simple string \textemdash or \texttendash sys.getsizeof('–') 40 sys.getsizeof('a') 26 jmf jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Fri, Jul 26, 2013 at 5:07 AM, wxjmfa...@gmail.com wrote: Let start with a simple string \textemdash or \texttendash sys.getsizeof('–') 40 sys.getsizeof('a') 26 Most of the cost is in those two apostrophes, look: sys.getsizeof('a') 26 sys.getsizeof(a) 8 Okay, that's slightly unfair (bonus points: figure out what I did to make this work; there are at least two right answers) but still, look at what an empty string costs: sys.getsizeof('') 25 Or look at the difference between one of these characters and two: sys.getsizeof('aa')-sys.getsizeof('a') 1 sys.getsizeof('––')-sys.getsizeof('–') 2 That's what the characters really cost. The overhead is fixed. It is, in fact, almost completely insignificant. The storage requirement for a non-ASCII, BMP-only string converges to two bytes per character. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
RE: RE Module Performance
Chris Angelico wrote: On Fri, Jul 26, 2013 at 5:07 AM, wxjmfa...@gmail.com wrote: Let start with a simple string \textemdash or \texttendash sys.getsizeof('-') 40 sys.getsizeof('a') 26 Most of the cost is in those two apostrophes, look: sys.getsizeof('a') 26 sys.getsizeof(a) 8 Okay, that's slightly unfair (bonus points: figure out what I did to make this work; there are at least two right answers) but still, look at what an empty string costs: I like bonus points. :) a = None sys.getsizeof(a) 8 Not sure what the other right answer is...booleans take 12 bytes (on 2.6) sys.getsizeof('') 25 Or look at the difference between one of these characters and two: sys.getsizeof('aa')-sys.getsizeof('a') 1 sys.getsizeof('--')-sys.getsizeof('-') 2 That's what the characters really cost. The overhead is fixed. It is, in fact, almost completely insignificant. The storage requirement for a non-ASCII, BMP-only string converges to two bytes per character. ChrisA -- http://mail.python.org/mailman/listinfo/python-list Ramit This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Wed, Jul 24, 2013 at 9:34 AM, Chris Angelico ros...@gmail.com wrote: On Thu, Jul 25, 2013 at 12:17 AM, David Hutto dwightdhu...@gmail.com wrote: I've screwed up plenty of times in python, but can write code like a pro when I'm feeling better(on SSI and medicaid). An editor can be built simply, but it's preference that makes the difference. Some might have used tkinter, gtk. wxpython or other methods for the task. I think the main issue in responding is your library preference, or widget set preference. These can make you right with some in your response, or wrong with others that have a preferable gui library that coincides with one's personal cognitive structure that makes t jmf's point is more about writing the editor widget (Scintilla, as opposed to SciTE), which most people will never bother to do. I've written several text editors, always by embedding someone else's widget, and therefore not concerning myself with its internal string representation. Frankly, Python's strings are a *terrible* internal representation for an editor widget - not because of PEP 393, but simply because they are immutable, and every keypress would result in a rebuilding of the string. On the flip side, I could quite plausibly imagine using a list of strings; whenever text gets inserted, the string gets split at that point, and a new string created for the insert (which also means that an Undo operation simply removes one entire string). In this usage, the FSR is beneficial, as it's possible to have different strings at different widths. But mainly, I'm just wondering how many people here have any basis from which to argue the point he's trying to make. I doubt most of us have (a) implemented an editor widget, or (b) tested multiple different internal representations to learn the true pros and cons of each. And even if any of us had, that still wouldn't have any bearing on PEP 393, which is about applications, not editor widgets. As stated above, Python strings before AND after PEP 393 are poor choices for an editor, ergo arguing from that standpoint is pretty useless. Not that that bothers jmf... I think you've just motivated me to finally get around to writing the custom output widget for my MUD client. Of course that will be simpler than a standard rich text editor widget, since it will never receive input from the user and modifications will (typically) always come in the form of append operations. I intend to write it in pure Python (well, wxPython), however. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. Except that the described representation used by Emacs is a variant of UTF-8, not an FSR. It doesn't have three different possible encodings for the letter 'a' depending on what other characters happen to be in the string. As I understand it, jfm would be perfectly happy if Python used UTF-8 (or presumably the Emacs variant) as its internal string representation. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, 25 Jul 2013 15:45:38 -0500, Ian Kelly wrote: On Thu, Jul 25, 2013 at 12:18 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Fri, 26 Jul 2013 01:36:07 +1000, Chris Angelico wrote: On Fri, Jul 26, 2013 at 1:26 AM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: On Thu, 25 Jul 2013 14:36:25 +0100, Jeremy Sanders wrote: To conserve memory, Emacs does not hold fixed-length 22-bit numbers that are codepoints of text characters within buffers and strings. Rather, Emacs uses a variable-length internal representation of characters, that stores each character as a sequence of 1 to 5 8-bit bytes, depending on the magnitude of its codepoint[1]. For example, any ASCII character takes up only 1 byte, a Latin-1 character takes up 2 bytes, etc. We call this representation of text multibyte. Well, you've just proven what Vim users have always suspected: Emacs doesn't really exist. ... lolwut? JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. QED. Except that the described representation used by Emacs is a variant of UTF-8, not an FSR. It doesn't have three different possible encodings for the letter 'a' depending on what other characters happen to be in the string. As I understand it, jfm would be perfectly happy if Python used UTF-8 (or presumably the Emacs variant) as its internal string representation. UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. To put it in terms of pseudo-code: # Python 3.3 def parse_string(astring): # Decision gets made once per string. if astring uses 1 byte: count = 1 elif astring uses 2 bytes: count = 2 else: count = 4 while not done: char = convert(next(count bytes)) # UTF-8 def parse_string(astring): while not done: b = next(1 byte) # Decision gets made for every single char if uses 1 byte: char = convert(b) elif uses 2 bytes: char = convert(b, next(1 byte)) elif uses 3 bytes: char = convert(b, next(2 bytes)) else: char = convert(b, next(3 bytes)) So UTF-8 requires much more runtime overhead than Python 3.3, and Emac's variation can in fact require more bytes per character than either. (UTF-8 and Python 3.3 can require up to four bytes, Emacs up to five.) I'm not surprised that JMF would prefer UTF-8 -- he is completely out of his depth, and is a fine example of the Dunning-Kruger effect in action. He is so sure he is right based on so little evidence. One advantage of UTF-8 is that for some BMP characters, you can get away with only three bytes instead of four. For transmitting data over the wire, or storage on disk, that's potentially up to a 25% reduction in space, which is not to be sneezed at. (Although in practice it's usually much less than that, since the most common characters are encoded to 1 or 2 bytes, not 4). But that comes at the cost of much more runtime overhead, which in my opinion makes UTF-8 a second-class string representation compared to fixed-width representations. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/25/2013 01:07 PM, wxjmfa...@gmail.com wrote: Let start with a simple string \textemdash or \texttendash sys.getsizeof('–') 40 sys.getsizeof('a') 26 That's meaningless. You're comparing the overhead of a string object itself (a one-time cost anyway), not the overhead of storing the actual characters. This is the only meaningful comparison: sys.getsizeof('––') - sys.getsizeof('–') sys.getsizeof('aa') - sys.getsizeof('a') Actually I'm not even sure what your point is after all this time of railing against FSR. You have said in the past that Python penalizes users of character sets that require wider byte encodings, but what would you have us do? use 4-byte characters and penalize everyone equally? Use 2-byte characters that incorrectly expose surrogate pairs for some characters? Use UTF-8 in memory and do O(n) indexing? Are your programs (actual programs, not contrived benchmarks) actually slower because of FSR? Is FSR incorrect? If so, according to what part of the unicode standard? I'm not trying to troll, or feed the troll. I'm actually curious. I think perhaps you feel that many of us who don't use unicode often don't understand unicode because some of us don't understand you. If so, I'm not sure that's actually true. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/25/2013 11:18 AM, Steven D'Aprano wrote: JMF has explained that it is impossible, impossible I say!, to write an editor using a flexible string representation. Since Emacs uses such a flexible string representation, Emacs is impossible, and therefore Emacs doesn't exist. Now I'm even more confused. He once pointed to Go as an example of how unicode should be done in a language. yet Go uses UTF-8 I think. But I don't think UTF-8 is what JMF refers to as flexible string representation. FSR does use 1,2 or 4 bytes per character, but each character in the string uses the same width. That's different from UTF-8 or UTF-16, which is variable width per character. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 8:48 PM, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: UTF-8 uses a flexible representation on a character-by-character basis. When parsing UTF-8, one needs to look at EVERY character to decide how many bytes you need to read. In Python 3, the flexible representation is on a string-by-string basis: once Python has looked at the string header, it can tell whether the *entire* string takes 1, 2 or 4 bytes per character, and the string is then fixed-width. You can't do that with UTF-8. UTF-8 does not use a flexible representation. A codec that is encoding a string in UTF-8 and examining a particular character does not have any choice of how to encode that character; there is exactly one sequence of bits that is the UTF-8 encoding for the character. Further, for any given sequence of code points there is exactly one sequence of bytes that is the UTF-8 encoding of those code points. In contrast, with the FSR there are as many as three different sequences of bytes that encode a sequence of code points, with one of them (the shortest) being canonical. That's what makes it flexible. Anyway, my point was just that Emacs is not a counter-example to jmf's claim about implementing text editors, because UTF-8 is not what he (or anybody else) is referring to when speaking of the FSR or something like the FSR. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
Le samedi 13 juillet 2013 01:13:47 UTC+2, Michael Torrie a écrit : On 07/12/2013 09:59 AM, Joshua Landau wrote: If you're interested, the basic of it is that strings now use a variable number of bytes to encode their values depending on whether values outside of the ASCII range and some other range are used, as an optimisation. Variable number of bytes is a problematic way to saying it. UTF-8 is a variable-number-of-bytes encoding scheme where each character can be 1, 2, 4, or more bytes, depending on the unicode character. As you can imagine this sort of encoding scheme would be very slow to do slicing with (looking up a character at a certain position). Python uses fixed-width encoding schemes, so they preserve the O(n) lookup speeds, but python will use 1, 2, or 4 bytes per every character in the string, depending on what is needed. Just in case the OP might have misunderstood what you are saying. jmf sees the case where a string is promoted from one width to another, and thinks that the brief slowdown in string operations to accomplish this is a problem. In reality I have never seen anyone use the types of string operations his pseudo benchmarks use, and in general Python 3's string behavior is pretty fast. And apparently much more correct than if jmf's ideas of unicode were implemented. -- Sorry, you are not understanding Unicode. What is a Unicode Transformation Format (UTF), what is the goal of a UTF and why it is important for an implementation to work with a UTF. Short example. Writing an editor with something like the FSR is simply impossible (properly). jmf -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Wed, Jul 24, 2013 at 11:40 PM, wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). jmf, have you ever written an editor with *any* string representation? Are you speaking from any level of experience at all? ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
I've screwed up plenty of times in python, but can write code like a pro when I'm feeling better(on SSI and medicaid). An editor can be built simply, but it's preference that makes the difference. Some might have used tkinter, gtk. wxpython or other methods for the task. I think the main issue in responding is your library preference, or widget set preference. These can make you right with some in your response, or wrong with others that have a preferable gui library that coincides with one's personal cognitive structure that makes t On Wed, Jul 24, 2013 at 9:48 AM, Chris Angelico ros...@gmail.com wrote: On Wed, Jul 24, 2013 at 11:40 PM, wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). jmf, have you ever written an editor with *any* string representation? Are you speaking from any level of experience at all? ChrisA -- http://mail.python.org/mailman/listinfo/python-list -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
I've screwed up plenty of times in python, but can write code like a pro when I'm feeling better(on SSI and medicaid). An editor can be built simply, but it's preference that makes the difference. Some might have used tkinter, gtk. wxpython or other methods for the task. I think the main issue in responding is your library preference, or widget set preference. These can make you right with some in your response, or wrong with others that have a preferable gui library that coincides with one's personal cognitive structure that makes it more usable in relation to how you learned a preferable gui kit. On Wed, Jul 24, 2013 at 10:17 AM, David Hutto dwightdhu...@gmail.comwrote: I've screwed up plenty of times in python, but can write code like a pro when I'm feeling better(on SSI and medicaid). An editor can be built simply, but it's preference that makes the difference. Some might have used tkinter, gtk. wxpython or other methods for the task. I think the main issue in responding is your library preference, or widget set preference. These can make you right with some in your response, or wrong with others that have a preferable gui library that coincides with one's personal cognitive structure that makes t On Wed, Jul 24, 2013 at 9:48 AM, Chris Angelico ros...@gmail.com wrote: On Wed, Jul 24, 2013 at 11:40 PM, wxjmfa...@gmail.com wrote: Short example. Writing an editor with something like the FSR is simply impossible (properly). jmf, have you ever written an editor with *any* string representation? Are you speaking from any level of experience at all? ChrisA -- http://mail.python.org/mailman/listinfo/python-list -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- Best Regards, David Hutto *CEO:* *http://www.hitwebdevelopment.com* -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 12:17 AM, David Hutto dwightdhu...@gmail.com wrote: I've screwed up plenty of times in python, but can write code like a pro when I'm feeling better(on SSI and medicaid). An editor can be built simply, but it's preference that makes the difference. Some might have used tkinter, gtk. wxpython or other methods for the task. I think the main issue in responding is your library preference, or widget set preference. These can make you right with some in your response, or wrong with others that have a preferable gui library that coincides with one's personal cognitive structure that makes t jmf's point is more about writing the editor widget (Scintilla, as opposed to SciTE), which most people will never bother to do. I've written several text editors, always by embedding someone else's widget, and therefore not concerning myself with its internal string representation. Frankly, Python's strings are a *terrible* internal representation for an editor widget - not because of PEP 393, but simply because they are immutable, and every keypress would result in a rebuilding of the string. On the flip side, I could quite plausibly imagine using a list of strings; whenever text gets inserted, the string gets split at that point, and a new string created for the insert (which also means that an Undo operation simply removes one entire string). In this usage, the FSR is beneficial, as it's possible to have different strings at different widths. But mainly, I'm just wondering how many people here have any basis from which to argue the point he's trying to make. I doubt most of us have (a) implemented an editor widget, or (b) tested multiple different internal representations to learn the true pros and cons of each. And even if any of us had, that still wouldn't have any bearing on PEP 393, which is about applications, not editor widgets. As stated above, Python strings before AND after PEP 393 are poor choices for an editor, ergo arguing from that standpoint is pretty useless. Not that that bothers jmf... ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote: Sorry, you are not understanding Unicode. What is a Unicode Transformation Format (UTF), what is the goal of a UTF and why it is important for an implementation to work with a UTF. Really? Enlighten me. Personally, I would never use UTF as a representation *in memory* for a unicode string if it were up to me. Why? Because UTF characters are not uniform in byte width so accessing positions within the string is terribly slow and has to always be done by starting at the beginning of the string. That's at minimum O(n) compared to FSR's O(1). Surely you understand this. Do you dispute this fact? UTF is a great choice for interchange, though, and indeed that's what it was designed for. Are you calling for UTF to be adopted as the internal, in-memory representation of unicode? Or would you simply settle for UCS-4? Please be clear here. What are you saying? Short example. Writing an editor with something like the FSR is simply impossible (properly). How? FSR is just an implementation detail. It could be UCS-4 and it would also work. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 07/24/2013 08:34 AM, Chris Angelico wrote: Frankly, Python's strings are a *terrible* internal representation for an editor widget - not because of PEP 393, but simply because they are immutable, and every keypress would result in a rebuilding of the string. On the flip side, I could quite plausibly imagine using a list of strings; whenever text gets inserted, the string gets split at that point, and a new string created for the insert (which also means that an Undo operation simply removes one entire string). In this usage, the FSR is beneficial, as it's possible to have different strings at different widths. Very good point. Seems like this is exactly what is tripping up jmf in general. His pseudo benchmarks are bogus for this exact reason. No one uses python strings in this fashion. Editors certainly would not. But then again his argument in the past does not mention editors. But it makes me wonder if jmf is using python strings appropriately, or even realizes they are immutable. But mainly, I'm just wondering how many people here have any basis from which to argue the point he's trying to make. I doubt most of us have (a) implemented an editor widget, or (b) tested multiple different internal representations to learn the true pros and cons of each. Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in pros and cons, and the cons of using UCS-2 (the old narrow builds) are well known. UCS-2 simply cannot represent all of unicode correctly. This is in the PEP of course. His most recent argument that Python should use UTF as a representation is very strange to be honest. The cons of UTF are apparent and widely known. The main con is that UTF strings are O(n) for indexing a position within the string. -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On Thu, Jul 25, 2013 at 12:47 AM, Michael Torrie torr...@gmail.com wrote: On 07/24/2013 07:40 AM, wxjmfa...@gmail.com wrote: Sorry, you are not understanding Unicode. What is a Unicode Transformation Format (UTF), what is the goal of a UTF and why it is important for an implementation to work with a UTF. Really? Enlighten me. Personally, I would never use UTF as a representation *in memory* for a unicode string if it were up to me. Why? Because UTF characters are not uniform in byte width so accessing positions within the string is terribly slow and has to always be done by starting at the beginning of the string. That's at minimum O(n) compared to FSR's O(1). Surely you understand this. Do you dispute this fact? Take care here; UTF is a general term for Unicode Translation Formats, of which one (UTF-32) is fixed-width. Every other UTF-n is variable width, though, so your point still stands. UTF-32 is the basis for Python's FSR. ChrisA -- http://mail.python.org/mailman/listinfo/python-list
Re: RE Module Performance
On 7/24/2013 11:00 AM, Michael Torrie wrote: On 07/24/2013 08:34 AM, Chris Angelico wrote: Frankly, Python's strings are a *terrible* internal representation for an editor widget - not because of PEP 393, but simply because they are immutable, and every keypress would result in a rebuilding of the string. On the flip side, I could quite plausibly imagine using a list of strings; I used exactly this, a list of strings, for a Python-coded text-only mock editor to replace the tk Text widget in idle tests. It works fine for the purpose. For small test texts, the inefficiency of immutable strings is not relevant. Tk apparently uses a C-coded btree rather than a Python list. All details are hidden, unless one finds and reads the source ;-), but but it uses C arrays rather than Python strings. In this usage, the FSR is beneficial, as it's possible to have different strings at different widths. For my purpose, the mock Text works the same in 2.7 and 3.3+. Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in pros and cons, They both have the pro that indexing is direct *and correct*. The cons are different. and the cons of using UCS-2 (the old narrow builds) are well known. UCS-2 simply cannot represent all of unicode correctly. Python's narrow builds, at least for several releases, were in between USC-2 and UTF-16 in that they used surrogates to represent all unicodes but did not correct indexing for the presence of astral chars. This is a nuisance for those who do use astral chars, such as emotes and CJK name chars, on an everyday basis. -- Terry Jan Reedy -- http://mail.python.org/mailman/listinfo/python-list