Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Sat, 28 May 2016 01:53 am, Rustom Mody wrote: > On Friday, May 27, 2016 at 7:21:41 PM UTC+5:30, Random832 wrote: >> On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote: >> > On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: >> > >> > > They are all ASCII derivatives. Those that aren't don't exist. >> > >> > *plonk* >> >> That's a bit harsh, considering that this argument started ... > > Is it now? > For some reason I am reminded that when I was in junior school and we > wanted to fight, we said "I am not talking to you!" made a certain gesture > and smartly marched off. > > I guess the gesture is culture-dependent and in these parts of the world > it sounds like "*plonk*" https://en.wikipedia.org/wiki/Plonk_%28Usenet%29 -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Saturday, May 28, 2016 at 12:34:14 AM UTC+5:30, Marko Rauhamaa wrote: > Random832 : > > > On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote: > >> On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: > >> > They are all ASCII derivatives. Those that aren't don't exist. > >> *plonk* > > > > That's a bit harsh, > > Everybody has a right to plonk anybody -- and even declare it > ceremoniously. > > Steven and I have recurring run-ins because Steven is an expert on > numerous trees while I'm constantly trying to shift the discussion to > the forest. How disconnected... Yours graph-theoretically, -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Random832 : > On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote: >> On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: >> > They are all ASCII derivatives. Those that aren't don't exist. >> *plonk* > > That's a bit harsh, Everybody has a right to plonk anybody -- and even declare it ceremoniously. Steven and I have recurring run-ins because Steven is an expert on numerous trees while I'm constantly trying to shift the discussion to the forest. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Sat, May 28, 2016 at 2:09 AM, Random832 wrote: > On Fri, May 27, 2016, at 11:53, Rustom Mody wrote: >> And coding systems are VERY political. >> Sure what characters are put in (and not) is political >> But more invisible but equally political is the collating order. >> >> eg No one understands what jmf's gripes are... My guess is that a Euro >> costs 3 times a Dollar. >> >> >>> "€".encode("UTF-8") >> b'\xe2\x82\xac' >> >>> "$".encode("UTF-8") >> b'$' >> >> [Its another matter that this is not the evil deed of python but of >> UTF-8!] > > AIUI jmf's issue is that python's string type (nothing to do with UTF-8) > doesn't treat all strings equally. Strings that are only in Latin-1 > (including your dollar example) have only one byte per character, > whereas strings with BMP characters have two bytes per character (he > also has some more difficult to understand objections to the large fixed > overhead and the cached UTF-8 version [which ASCII strings don't have]) The objection, thus, is "some strings perform faster than others do". The only time that's ever been a serious consideration has been in cryptography, where timing-based attacks can be used to leech info about a private key. But this ain't that. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Fri, May 27, 2016, at 11:53, Rustom Mody wrote: > And coding systems are VERY political. > Sure what characters are put in (and not) is political > But more invisible but equally political is the collating order. > > eg No one understands what jmf's gripes are... My guess is that a Euro > costs 3 times a Dollar. > > >>> "€".encode("UTF-8") > b'\xe2\x82\xac' > >>> "$".encode("UTF-8") > b'$' > > [Its another matter that this is not the evil deed of python but of > UTF-8!] AIUI jmf's issue is that python's string type (nothing to do with UTF-8) doesn't treat all strings equally. Strings that are only in Latin-1 (including your dollar example) have only one byte per character, whereas strings with BMP characters have two bytes per character (he also has some more difficult to understand objections to the large fixed overhead and the cached UTF-8 version [which ASCII strings don't have]) -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Friday, May 27, 2016 at 7:21:41 PM UTC+5:30, Random832 wrote: > On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote: > > On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: > > > > > They are all ASCII derivatives. Those that aren't don't exist. > > > > *plonk* > > That's a bit harsh, considering that this argument started ... Is it now? For some reason I am reminded that when I was in junior school and we wanted to fight, we said "I am not talking to you!" made a certain gesture and smartly marched off. I guess the gesture is culture-dependent and in these parts of the world it sounds like "*plonk*" Back in the adult world when pique is out of proportion to irritant we may guess there is some politics around And coding systems are VERY political. Sure what characters are put in (and not) is political But more invisible but equally political is the collating order. eg No one understands what jmf's gripes are... My guess is that a Euro costs 3 times a Dollar. >>> "€".encode("UTF-8") b'\xe2\x82\xac' >>> "$".encode("UTF-8") b'$' [Its another matter that this is not the evil deed of python but of UTF-8!] -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote: > On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: > > > They are all ASCII derivatives. Those that aren't don't exist. > > *plonk* That's a bit harsh, considering that this argument started when you invented your own definition of "ASCII derivative", which he never accepted and has no obligation to accept, in order to prove that he's wrong. That's called a straw-man argument. -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote: > They are all ASCII derivatives. Those that aren't don't exist. *plonk* -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Steven D'Aprano : > I don't mind being corrected if I make a genuine mistake, in fact I > appreciate correction. But being corrected for something I already > acknowledged? That's just arguing for the sake of arguing. > [...] >> ASCII derivatives are in wide use in the Americas and Antarctica as >> well. They have been spotted in Australia, New Zealand, Oceania and >> Africa. You shouldn't be surprized if you run into them in Asia, either. > > Of course. > > But they're not *all encodings*, and while they're important, there > are plenty of non-ASCII encodings and encodings which violate the "one > byte equals one character" invariant followed by ASCII and > extended-ASCII encodings. They are all ASCII derivatives. Those that aren't don't exist. The vast majority of code pages in current use are supersets of ASCII https://en.wikipedia.org/wiki/Code_page#Relationship_to_ASCII> Just like a byte is always 8 bits wide, and C's integers are all two's-complement. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Fri, 27 May 2016 04:10 pm, Marko Rauhamaa wrote: > Steven D'Aprano : >> This concept of ASCII = "all character sets", or "nearly all", or >> "okay, maybe not nearly all of them, but just the important ones" is >> terribly Euro-centric. The very idea would be laughable in Japan and >> other East Asian countries, where Shift-JIS and Big5 still dominate. > > Shift-JIS and Big5 are ASCII derivatives: Gosh. Really? If you looked at what I wrote, I said: "Then there are the variable-width encodings which are backwards compatible with ASCII *only* in the sense that text containing only ASCII characters uses the same sequence of bytes as ASCII would." and gave both Shift-JIS and Big5 as examples. But you cannot treat them as "like ASCII" or "extended ASCII" because they are multibyte encodings. Unlike UTF-8, if you mangle a Shift-JIS or Big5 multibyte sequence, you don't just corrupt a single character, you corrupt a potentially unlimited amount of subsequent text. I don't mind being corrected if I make a genuine mistake, in fact I appreciate correction. But being corrected for something I already acknowledged? That's just arguing for the sake of arguing. [...] > ASCII derivatives are in wide use in the Americas and Antarctica as > well. They have been spotted in Australia, New Zealand, Oceania and > Africa. You shouldn't be surprized if you run into them in Asia, either. Of course. But they're not *all encodings*, and while they're important, there are plenty of non-ASCII encodings and encodings which violate the "one byte equals one character" invariant followed by ASCII and extended-ASCII encodings. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Steven D'Aprano : > This concept of ASCII = "all character sets", or "nearly all", or > "okay, maybe not nearly all of them, but just the important ones" is > terribly Euro-centric. The very idea would be laughable in Japan and > other East Asian countries, where Shift-JIS and Big5 still dominate. Shift-JIS and Big5 are ASCII derivatives: >>> "hello".encode("shift-JIS") b'hello' >>> "hello".encode("big5") b'hello' > So please, open your mind to the reality of computing outside of > Europe. ASCII derivatives are in wide use in the Americas and Antarctica as well. They have been spotted in Australia, New Zealand, Oceania and Africa. You shouldn't be surprized if you run into them in Asia, either. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Erik writes: > On 26/05/16 08:21, Jussi Piitulainen wrote: >> UTF-8 ASCII is nice >> >> UTF-16 ASCII is weird. > > I am dumbstruck. I'm joking, of course. But those statements do make sense when one knows to distinguish a character set from its encoding as bytes, and then the UTF-8 encoding of ASCII really is nice. Where I live, anyway :) -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Fri, 27 May 2016 07:12 am, Marko Rauhamaa wrote: > However, I must correct myself slightly: ASCII refers to any > byte-oriented character encoding scheme *largely coinciding with ASCII > proper*. But since all of them *are* derivatives of ASCII proper, > mentioning is somewhat redundant. "All" of them? Here is a small selection of codecs provided by Python: py> codecs = "cp037 cp273 cp500 cp875 cp1026 cp1140 utf_16be".split() py> for cd in codecs: ... print("ab.12".encode(cd)) # ASCII gives b'ab.12' ... b'\x81\x82K\xf1\xf2' b'\x81\x82K\xf1\xf2' b'\x81\x82K\xf1\xf2' b'\x81\x82K\xf1\xf2' b'\x81\x82K\xf1\xf2' b'\x81\x82K\xf1\xf2' b'\x00a\x00b\x00.\x001\x002' There's also at least one other double-byte character set which, as far as I can tell, isn't supported by Python: KS X 1001, used in Korea. Then there are the variable-width encodings which are backwards compatible with ASCII only in the sense that text containing *only* ASCII characters uses the same sequence of bytes as ASCII would. But being variable-width, they cannot be treated as a simple array of bytes with a fixed 1 byte = 1 character mapping. Examples include UTF-8, UTF-7, the various Shift-JIS encodings, EUC-JP, EUC-KR, EUC-TW, GB18030, Big5, and others. This concept of ASCII = "all character sets", or "nearly all", or "okay, maybe not nearly all of them, but just the important ones" is terribly Euro-centric. The very idea would be laughable in Japan and other East Asian countries, where Shift-JIS and Big5 still dominate. So please, open your mind to the reality of computing outside of Europe. ASCII-based encodings no more encompasses all of the world's natural languages (not even the "important" ones) than "everyone is using Internet Explorer and Windows XP, right?" describes the state of the Internet. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Erik : > On 26/05/16 10:20, Marko Rauhamaa wrote: >> ASCII has taken new meanings. For most coders, in relaxed style, it >> refers to any byte-oriented character encoding scheme. In C terms, >> >> ASCII == char * > > Is this really true? So by "taken new meanings" you are saying that it > has actually lost all meaning. You are exaggerating. > The 'S' stands for "Standard". It's an encoding (each byte value refers > to a particular character value according to that standard). > > To say that any array of bytes, regardless of what each byte value > should be interpreted as, is "ASCII" makes no sense. Read what I wrote: "character encoding scheme". Even C's "char" type strongly suggests textual characters. However, I must correct myself slightly: ASCII refers to any byte-oriented character encoding scheme *largely coinciding with ASCII proper*. But since all of them *are* derivatives of ASCII proper, mentioning is somewhat redundant. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On 26/05/16 08:21, Jussi Piitulainen wrote: UTF-8 ASCII is nice UTF-16 ASCII is weird. I am dumbstruck. E. -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On 26/05/16 10:20, Marko Rauhamaa wrote: ASCII has taken new meanings. For most coders, in relaxed style, it refers to any byte-oriented character encoding scheme. In C terms, ASCII == char * Is this really true? So by "taken new meanings" you are saying that it has actually lost all meaning. The 'S' stands for "Standard". It's an encoding (each byte value refers to a particular character value according to that standard). To say that any array of bytes, regardless of what each byte value should be interpreted as, is "ASCII" makes no sense. How "relaxed" are these 'coders' you're referring to, exactly? ;) Or, have I fallen for your trap, and you're joking with me too? E. -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Thursday, May 26, 2016 at 1:41:41 PM UTC+5:30, Erik wrote: > On 26/05/16 02:28, Dennis Lee Bieber wrote: > > On Wed, 25 May 2016 22:03:34 +0100, Erik > > declaimed the following: > > > >> Indeed - at that time, I was working with COBOL on an IBM S/370. On that > >> system, we used EBCDIC ASCII. That was the wierdest ASCII of all ;) > >> > > It would have to be... Extended Binary Coded Decimal Interchange Code, > > as I recall, predates American Standard Code for Information Interchange. > > > > EBCDIC's 8-bit code is actually more closely linked to Hollerith card > > encodings. > > I really didn't think it would be necessary to point this out (I thought > the "" and emoji would be enough), but for the record, my > previous message was clearly a joke. > > To break it down, Stephen was making the observation that people call > all sorts of extended ASCII encodings (including proprietary things) > "ASCII". So I took it to the extreme and called something that had > nothing to do with ASCII a type of ASCII. > > As they say, if one has to explain one's jokes then they are probably > not funny ... JFTR I found the comment hilarious and even thought of incorporating it into http://blog.languager.org/2014/04/unicode-and-unix-assumption.html but could not find a smooth place to do so. [Mad run: Intensive course to run next week] -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Erik : > To break it down, Stephen was making the observation that people call > all sorts of extended ASCII encodings (including proprietary things) > "ASCII". So I took it to the extreme and called something that had > nothing to do with ASCII a type of ASCII. ASCII has taken new meanings. For most coders, in relaxed style, it refers to any byte-oriented character encoding scheme. In C terms, ASCII == char * Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Thu, May 26, 2016 at 7:11 PM, Marko Rauhamaa wrote: > Python didn't come out unscathed, either. Multithreading is being > replaced with asyncio Incorrect. Threading is still important - it's not being replaced. Asynchronous code support is being added to an existing pool of multiprocessing techniques, so you can now use preemptive processes or threads, or cooperative asyncio, depending on what you need. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Jussi Piitulainen : > UTF-16 ASCII is weird. Wierd. Probably all right in an environment > that is otherwise set to use UTF-16. > > Nothing is as weird as a mix of different encodings of a foreign > script in the same "plain text" file, said to be "Unicode". Some children are just born under unlucky stars. Windows and Java are among them. If they had been designed a few years earlier or a few years later, they could have evaded the UTF-16 embarrassment, maybe the multithreading embarrassment as well. Python didn't come out unscathed, either. Multithreading is being replaced with asyncio, and Python 3 broke backward-compatibility to get Unicode right. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On 26/05/16 02:28, Dennis Lee Bieber wrote: On Wed, 25 May 2016 22:03:34 +0100, Erik declaimed the following: Indeed - at that time, I was working with COBOL on an IBM S/370. On that system, we used EBCDIC ASCII. That was the wierdest ASCII of all ;) It would have to be... Extended Binary Coded Decimal Interchange Code, as I recall, predates American Standard Code for Information Interchange. EBCDIC's 8-bit code is actually more closely linked to Hollerith card encodings. I really didn't think it would be necessary to point this out (I thought the "" and emoji would be enough), but for the record, my previous message was clearly a joke. To break it down, Stephen was making the observation that people call all sorts of extended ASCII encodings (including proprietary things) "ASCII". So I took it to the extreme and called something that had nothing to do with ASCII a type of ASCII. As they say, if one has to explain one's jokes then they are probably not funny ... :( E. -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Thursday, May 26, 2016 at 12:52:09 PM UTC+5:30, Jussi Piitulainen wrote: > UTF-16 ASCII is weird. Wierd. Probably all right in an environment that > is otherwise set to use UTF-16. In http://blog.languager.org/2015/03/whimsical-unicode.html are some examples of why UTF-16 is bug-inviting [ section is "Wide is too narrow" ] -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
Erik writes: > On 25/05/16 11:19, Steven D'Aprano wrote: >> On Wednesday 25 May 2016 19:10, Christopher Reimer wrote: >> >>> Back in the early 1980's, I grew up on 8-bit processors and latin-1 >>> was all we had for ASCII. >> >> It really, truly wasn't. But you can be forgiven for not knowing >> that, since until the rise of the public Internet most people weren't >> exposed to more than one code page or encoding, and it was incredibly >> common for people to call *any* encoding "ASCII". > > Indeed - at that time, I was working with COBOL on an IBM S/370. On > that system, we used EBCDIC ASCII. That was the wierdest ASCII of all > ;) UTF-8 ASCII is nice. UTF-16 ASCII is weird. Wierd. Probably all right in an environment that is otherwise set to use UTF-16. Nothing is as weird as a mix of different encodings of a foreign script in the same "plain text" file, said to be "Unicode". -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On 25/05/16 11:19, Steven D'Aprano wrote: On Wednesday 25 May 2016 19:10, Christopher Reimer wrote: Back in the early 1980's, I grew up on 8-bit processors and latin-1 was all we had for ASCII. It really, truly wasn't. But you can be forgiven for not knowing that, since until the rise of the public Internet most people weren't exposed to more than one code page or encoding, and it was incredibly common for people to call *any* encoding "ASCII". Indeed - at that time, I was working with COBOL on an IBM S/370. On that system, we used EBCDIC ASCII. That was the wierdest ASCII of all ;) E. -- https://mail.python.org/mailman/listinfo/python-list
Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]
On Wed, May 25, 2016 at 8:19 PM, Steven D'Aprano wrote: > While the code page system was necessary at > the time, the legacy of them today continues to plague computer users, causing > moji-bake, errors on file systems[1], and holding back the adoption of > Unicode. > > [1] I'm speaking from experience there. Take files created on a Windows > machine > using some legacy code page, and try to copy them to another server using > Unicode, and depending on the intelligence of the server, you may not be able > to copy them. On the flip side, there are many file names I can easily create > on Linux but cannot copy to a FAT file system. And getting a .zip file from a Windows user that had a file in it called "Café Sounds.something", extracting it on Linux, and finding it called "Caf\xe9" or something. Very annoying. Fortunately it was only the one file in a large directory. ChrisA -- https://mail.python.org/mailman/listinfo/python-list