Le samedi 27 juillet 2013 04:05:03 UTC+2, Michael Torrie a écrit : > On 07/26/2013 07:21 AM, wxjmfa...@gmail.com wrote: > > >>>> sys.getsizeof('––') - sys.getsizeof('–') > > > > > > I have already explained / commented this. > > > > Maybe it got lost in translation, but I don't understand your point with > > that. > > > > > Hint: To understand Unicode (and every coding scheme), you should > > > understand "utf". The how and the *why*. > > > > Hmm, so if python used utf-8 internally to represent unicode strings > > would not that punish *all* users (not just non-ascii users) since > > searching a string for a certain character position requires an O(n) > > operation? UTF-32 I could see (and indeed that's essentially what FSR > > uses when necessary does it not?), but not utf-8 or utf-16.
------ Did you read my previous link? Unicode Character Encoding Model. Did you understand it? Unicode only - No FSR (I skip some points and I still attempt to be still correct.) Unicode is a four-steps process. [ {unique set of characters} --> {unique set of code points, the "labels"} --> {unique set of encoded code points} ] --> implementation (bytes) First point to notice. "pure unicode", [...], is different from the "implementation". *This is a deliberate choice*. The critical step is the path {unique set of characters} ---> {unique set of encoded code points} in such a way so that the implementation can "work comfortably" with this *unique* set of encoded code points. Conceptualy, the implementation works with an unique set of "already prepared encoded code points". This is a very critical step. To explain it in a dirty way: in the above chain, this problem is "already" eliminated and solved. Like a byte/char coding schemes where this step is a no-op. Now, and if you wish this is a seperated/different problem. To create this unique set of encoded code points, "Unicode" uses these "utf(s)". I repeat again, a confusing name, for the process and the result of the process. (I neglect ucs). What are these? Chunks of bits, group of 8/16/32 bits, words. It is up to the implementation to convert these sequences of bits into bytes, ***if you wish to convert these in bytes!***. Suprise! Why not putting two of the 32-bits words in a 64-bits "machine"? (see golang / rune / int32). Back to utf. utfs are not only elements of a unique set of encoded code points. They have an interesting feature. Each "utf chunk" holds intrisically the character (in fact the code point) it is supposed to represent. In utf-32, the obvious case, it is just the code point. In utf-8, that's the first chunk which helps and utf-16 is a mixed case (utf-8 / utf-32). In other words, in an implementation using bytes, for any pointer position it is always possible to find the corresponding encoded code point and from this the corresponding character without any "programmed" information. See my editor example, how to find the char under the caret? In fact, a silly example, how can the caret can be positioned or moved, if the underlying corresponding encoded code point can not be dicerned! Next step and one another separated problem. Why all these utf versions? It is always the same story. Some prefer the universality (utf-32) and some prefer, well, some kind of conservatism. utf-8 is more complicated, it demands more work and logically, in an expected way, some performance regression. utf-8 is more suited to produce bytes, utf16/32 for internal processing. utf-8 had no choice to lose the indexing. And so on. Fact: all these coding schemes are working with a unique set of encoded code points (suprise again, it's like byte string!). The loss of performance of utf-8 is very minimal compared to the loss of performance one can get compare to a multiple coding scheme. This kind of work has been done, and if my informations are correct, even by the creators of utf-8. (There are sometimes good scientists). There are plenty of advantages in using utf instead of something else and advantages in other fields than just the pure coding. utf-16/32 schemes have the advantages to ditch ascii for ever. The ascii concept is no more existing. One should also understand that all this stuff has not been created from scratch. It was a balance between existing technologies. MS sticked with the idea, no more ascii, let's use ucs-2 and the *x world breaks the unicode adoption as possible. utf-8 is one of the compromise for the adoption of Unicode. Retrospectivly, a not so good compromise. Computer scientists are funny scientists. They do love to solve the problems they created themselves. ----- Quickly. sys.getsizeof() at the light of what I explained. 1) As this FSR works with multiple encoding, it has to keep track of the encoding. it puts is in the overhead of str class (overhead = real overhead + encoding). In such a absurd way, that a >>> sys.getsizeof('€') 40 needs 14 bytes more than a >>> sys.getsizeof('z') 26 You may vary the length of the str. The problem is still here. Not bad for a coding scheme. 2) Take a look at this. Get rid of the overhead. >>> sys.getsizeof('b'*1000000 + 'c') 1000026 >>> sys.getsizeof('b'*1000000 + '€') 2000040 What does it mean? It means that Python has to reencode a str every time it is necessary because it works with multiple codings. This FSR is not even a copy of the utf-8. >>> len(('b'*1000000 + '€').encode('utf-8')) 1000003 utf-8 or any (utf) never need and never spend their time in reencoding. 3) Unicode compliance. We know retrospectively, latin-1, is was a bad choice. Unusable for 17 European languages. Believe of not. 20 years of Unicode of incubation is not long enough to learn it. When discussing once with a French Python core dev, one with commit access, he did not know one can not use latin-1 for the French language! BTW, I proposed to the French devs, to test the FST with the set of characters, recognized by the "Imprimerie Nationale", some kind of the legal French authority regarding characters and typography. Never heared about it. Of course, I dit it. In short FSR = bad performance + bad memory mangement + non unicode compliance. Good point. FSR, nice tool for those who wish to teach Unicode. It is not every day, one has such an opportunity. --------- I'm practicaly no more programming, writing applications. I'm still active and observing since a decade and plus all this unicode world, languages (go, c#, Python, Ruby), text processing systems (esp. Unicode TeX engines) and font technology. Very, very interesting. jmf -- http://mail.python.org/mailman/listinfo/python-list