Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-28 Thread Steven D'Aprano
On Sat, 28 May 2016 01:53 am, Rustom Mody wrote:

> On Friday, May 27, 2016 at 7:21:41 PM UTC+5:30, Random832 wrote:
>> On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote:
>> > On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:
>> > 
>> > > They are all ASCII derivatives. Those that aren't don't exist.
>> > 
>> > *plonk*
>> 
>> That's a bit harsh, considering that this argument started ...
> 
> Is it now?
> For some reason I am reminded that when I was in junior school and we
> wanted to fight, we said "I am not talking to you!" made a certain gesture
> and smartly marched off.
> 
> I guess the gesture is culture-dependent and in these parts of the world
> it sounds like "*plonk*"

https://en.wikipedia.org/wiki/Plonk_%28Usenet%29



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Rustom Mody
On Saturday, May 28, 2016 at 12:34:14 AM UTC+5:30, Marko Rauhamaa wrote:
> Random832 :
> 
> > On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote:
> >> On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:
> >> > They are all ASCII derivatives. Those that aren't don't exist.
> >> *plonk*
> >
> > That's a bit harsh,
> 
> Everybody has a right to plonk anybody -- and even declare it
> ceremoniously.
> 
> Steven and I have recurring run-ins because Steven is an expert on
> numerous trees while I'm constantly trying to shift the discussion to
> the forest.

How disconnected...
Yours graph-theoretically,
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Marko Rauhamaa
Random832 :

> On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote:
>> On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:
>> > They are all ASCII derivatives. Those that aren't don't exist.
>> *plonk*
>
> That's a bit harsh,

Everybody has a right to plonk anybody -- and even declare it
ceremoniously.

Steven and I have recurring run-ins because Steven is an expert on
numerous trees while I'm constantly trying to shift the discussion to
the forest.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Chris Angelico
On Sat, May 28, 2016 at 2:09 AM, Random832  wrote:
> On Fri, May 27, 2016, at 11:53, Rustom Mody wrote:
>> And coding systems are VERY political.
>> Sure what characters are put in (and not) is political
>> But more invisible but equally political is the collating order.
>>
>> eg No one understands what jmf's gripes are... My guess is that a Euro
>> costs 3 times a Dollar.
>>
>> >>> "€".encode("UTF-8")
>> b'\xe2\x82\xac'
>> >>> "$".encode("UTF-8")
>> b'$'
>>
>> [Its another matter that this is not the evil deed of python but of
>> UTF-8!]
>
> AIUI jmf's issue is that python's string type (nothing to do with UTF-8)
> doesn't treat all strings equally. Strings that are only in Latin-1
> (including your dollar example) have only one byte per character,
> whereas strings with BMP characters have two bytes per character (he
> also has some more difficult to understand objections to the large fixed
> overhead and the cached UTF-8 version [which ASCII strings don't have])

The objection, thus, is "some strings perform faster than others do".
The only time that's ever been a serious consideration has been in
cryptography, where timing-based attacks can be used to leech info
about a private key. But this ain't that.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Random832
On Fri, May 27, 2016, at 11:53, Rustom Mody wrote:
> And coding systems are VERY political.
> Sure what characters are put in (and not) is political
> But more invisible but equally political is the collating order.
> 
> eg No one understands what jmf's gripes are... My guess is that a Euro
> costs 3 times a Dollar.
> 
> >>> "€".encode("UTF-8")
> b'\xe2\x82\xac'
> >>> "$".encode("UTF-8")
> b'$'
> 
> [Its another matter that this is not the evil deed of python but of
> UTF-8!]

AIUI jmf's issue is that python's string type (nothing to do with UTF-8)
doesn't treat all strings equally. Strings that are only in Latin-1
(including your dollar example) have only one byte per character,
whereas strings with BMP characters have two bytes per character (he
also has some more difficult to understand objections to the large fixed
overhead and the cached UTF-8 version [which ASCII strings don't have])
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Rustom Mody
On Friday, May 27, 2016 at 7:21:41 PM UTC+5:30, Random832 wrote:
> On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote:
> > On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:
> > 
> > > They are all ASCII derivatives. Those that aren't don't exist.
> > 
> > *plonk*
> 
> That's a bit harsh, considering that this argument started ...

Is it now?
For some reason I am reminded that when I was in junior school and we wanted 
to fight, we said "I am not talking to you!" made a certain gesture and smartly
marched off.

I guess the gesture is culture-dependent and in these parts of the world it
sounds like "*plonk*"

Back in the adult world when pique is out of proportion to irritant we may guess
there is some politics around

And coding systems are VERY political.
Sure what characters are put in (and not) is political
But more invisible but equally political is the collating order.

eg No one understands what jmf's gripes are... My guess is that a Euro
costs 3 times a Dollar.

>>> "€".encode("UTF-8")
b'\xe2\x82\xac'
>>> "$".encode("UTF-8")
b'$'

[Its another matter that this is not the evil deed of python but of UTF-8!]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Random832
On Fri, May 27, 2016, at 05:56, Steven D'Aprano wrote:
> On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:
> 
> > They are all ASCII derivatives. Those that aren't don't exist.
> 
> *plonk*

That's a bit harsh, considering that this argument started when you
invented your own definition of "ASCII derivative", which he never
accepted and has no obligation to accept, in order to prove that he's
wrong. That's called a straw-man argument.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Steven D'Aprano
On Fri, 27 May 2016 05:04 pm, Marko Rauhamaa wrote:

> They are all ASCII derivatives. Those that aren't don't exist.

*plonk*




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-27 Thread Marko Rauhamaa
Steven D'Aprano :

> I don't mind being corrected if I make a genuine mistake, in fact I
> appreciate correction. But being corrected for something I already
> acknowledged? That's just arguing for the sake of arguing.
> [...]
>> ASCII derivatives are in wide use in the Americas and Antarctica as
>> well. They have been spotted in Australia, New Zealand, Oceania and
>> Africa. You shouldn't be surprized if you run into them in Asia, either.
>
> Of course.
>
> But they're not *all encodings*, and while they're important, there
> are plenty of non-ASCII encodings and encodings which violate the "one
> byte equals one character" invariant followed by ASCII and
> extended-ASCII encodings.

They are all ASCII derivatives. Those that aren't don't exist.

   The vast majority of code pages in current use are supersets of ASCII
   https://en.wikipedia.org/wiki/Code_page#Relationship_to_ASCII>

Just like a byte is always 8 bits wide, and C's integers are all
two's-complement.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Steven D'Aprano
On Fri, 27 May 2016 04:10 pm, Marko Rauhamaa wrote:

> Steven D'Aprano :
>> This concept of ASCII = "all character sets", or "nearly all", or
>> "okay, maybe not nearly all of them, but just the important ones" is
>> terribly Euro-centric. The very idea would be laughable in Japan and
>> other East Asian countries, where Shift-JIS and Big5 still dominate.
> 
> Shift-JIS and Big5 are ASCII derivatives:

Gosh. Really?

If you looked at what I wrote, I said:

"Then there are the variable-width encodings which are backwards compatible
with ASCII *only* in the sense that text containing only ASCII characters
uses the same sequence of bytes as ASCII would."

and gave both Shift-JIS and Big5 as examples. But you cannot treat them
as "like ASCII" or "extended ASCII" because they are multibyte encodings.

Unlike UTF-8, if you mangle a Shift-JIS or Big5 multibyte sequence, you
don't just corrupt a single character, you corrupt a potentially unlimited
amount of subsequent text.

I don't mind being corrected if I make a genuine mistake, in fact I
appreciate correction. But being corrected for something I already
acknowledged? That's just arguing for the sake of arguing.



[...]
> ASCII derivatives are in wide use in the Americas and Antarctica as
> well. They have been spotted in Australia, New Zealand, Oceania and
> Africa. You shouldn't be surprized if you run into them in Asia, either.

Of course.

But they're not *all encodings*, and while they're important, there are
plenty of non-ASCII encodings and encodings which violate the "one byte
equals one character" invariant followed by ASCII and extended-ASCII
encodings.




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Marko Rauhamaa
Steven D'Aprano :
> This concept of ASCII = "all character sets", or "nearly all", or
> "okay, maybe not nearly all of them, but just the important ones" is
> terribly Euro-centric. The very idea would be laughable in Japan and
> other East Asian countries, where Shift-JIS and Big5 still dominate.

Shift-JIS and Big5 are ASCII derivatives:

   >>> "hello".encode("shift-JIS")
   b'hello'
   >>> "hello".encode("big5")
   b'hello'

> So please, open your mind to the reality of computing outside of
> Europe.

ASCII derivatives are in wide use in the Americas and Antarctica as
well. They have been spotted in Australia, New Zealand, Oceania and
Africa. You shouldn't be surprized if you run into them in Asia, either.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Jussi Piitulainen
Erik writes:

> On 26/05/16 08:21, Jussi Piitulainen wrote:
>> UTF-8 ASCII is nice
>>
>> UTF-16 ASCII is weird.
>
> I am dumbstruck.

I'm joking, of course.

But those statements do make sense when one knows to distinguish a
character set from its encoding as bytes, and then the UTF-8 encoding of
ASCII really is nice.

Where I live, anyway :)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Steven D'Aprano
On Fri, 27 May 2016 07:12 am, Marko Rauhamaa wrote:

> However, I must correct myself slightly: ASCII refers to any
> byte-oriented character encoding scheme *largely coinciding with ASCII
> proper*. But since all of them *are* derivatives of ASCII proper,
> mentioning is somewhat redundant.

"All" of them?


Here is a small selection of codecs provided by Python:

py> codecs = "cp037 cp273 cp500 cp875 cp1026 cp1140 utf_16be".split()
py> for cd in codecs:
... print("ab.12".encode(cd))  # ASCII gives b'ab.12'
...
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x81\x82K\xf1\xf2'
b'\x00a\x00b\x00.\x001\x002'


There's also at least one other double-byte character set which, as far as I
can tell, isn't supported by Python: KS X 1001, used in Korea.

Then there are the variable-width encodings which are backwards compatible
with ASCII only in the sense that text containing *only* ASCII characters
uses the same sequence of bytes as ASCII would. But being variable-width,
they cannot be treated as a simple array of bytes with a fixed 1 byte = 1
character mapping. Examples include UTF-8, UTF-7, the various Shift-JIS
encodings, EUC-JP, EUC-KR, EUC-TW, GB18030, Big5, and others.

This concept of ASCII = "all character sets", or "nearly all", or "okay,
maybe not nearly all of them, but just the important ones" is terribly
Euro-centric. The very idea would be laughable in Japan and other East
Asian countries, where Shift-JIS and Big5 still dominate.

So please, open your mind to the reality of computing outside of Europe.
ASCII-based encodings no more encompasses all of the world's natural
languages (not even the "important" ones) than "everyone is using Internet
Explorer and Windows XP, right?" describes the state of the Internet.




-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Marko Rauhamaa
Erik :

> On 26/05/16 10:20, Marko Rauhamaa wrote:
>> ASCII has taken new meanings. For most coders, in relaxed style, it
>> refers to any byte-oriented character encoding scheme. In C terms,
>>
>>  ASCII == char *
>
> Is this really true? So by "taken new meanings" you are saying that it
> has actually lost all meaning.

You are exaggerating.

> The 'S' stands for "Standard". It's an encoding (each byte value refers
> to a particular character value according to that standard).
>
> To say that any array of bytes, regardless of what each byte value
> should be interpreted as, is "ASCII" makes no sense.

Read what I wrote: "character encoding scheme". Even C's "char" type
strongly suggests textual characters.

However, I must correct myself slightly: ASCII refers to any
byte-oriented character encoding scheme *largely coinciding with ASCII
proper*. But since all of them *are* derivatives of ASCII proper,
mentioning is somewhat redundant.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Erik

On 26/05/16 08:21, Jussi Piitulainen wrote:

UTF-8 ASCII is nice

UTF-16 ASCII is weird.


I am dumbstruck.

E.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Erik

On 26/05/16 10:20, Marko Rauhamaa wrote:

ASCII has taken new meanings. For most coders, in relaxed style, it
refers to any byte-oriented character encoding scheme. In C terms,

 ASCII == char *


Is this really true? So by "taken new meanings" you are saying that it 
has actually lost all meaning.


The 'S' stands for "Standard". It's an encoding (each byte value refers 
to a particular character value according to that standard).


To say that any array of bytes, regardless of what each byte value 
should be interpreted as, is "ASCII" makes no sense.


How "relaxed" are these 'coders' you're referring to, exactly? ;)



Or, have I fallen for your trap, and you're joking with me too?

E.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Rustom Mody
On Thursday, May 26, 2016 at 1:41:41 PM UTC+5:30, Erik wrote:
> On 26/05/16 02:28, Dennis Lee Bieber wrote:
> > On Wed, 25 May 2016 22:03:34 +0100, Erik
> > declaimed the following:
> >
> >> Indeed - at that time, I was working with COBOL on an IBM S/370. On that
> >> system, we used EBCDIC ASCII. That was the wierdest ASCII of all  ;)
> >>
> > It would have to be... Extended Binary Coded Decimal Interchange Code,
> > as I recall, predates American Standard Code for Information Interchange.
> >
> > EBCDIC's 8-bit code is actually more closely linked to Hollerith card
> > encodings.
> 
> I really didn't think it would be necessary to point this out (I thought 
> the "" and emoji would be enough), but for the record, my 
> previous message was clearly a joke.
> 
> To break it down, Stephen was making the observation that people call 
> all sorts of extended ASCII encodings (including proprietary things) 
> "ASCII". So I took it to the extreme and called something that had 
> nothing to do with ASCII a type of ASCII.
> 
> As they say, if one has to explain one's jokes then they are probably 
> not funny ...

JFTR I found the comment hilarious and even thought of incorporating it into
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html
but could not find a smooth place to do so.
[Mad run: Intensive course to run next week]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Marko Rauhamaa
Erik :

> To break it down, Stephen was making the observation that people call
> all sorts of extended ASCII encodings (including proprietary things)
> "ASCII". So I took it to the extreme and called something that had
> nothing to do with ASCII a type of ASCII.

ASCII has taken new meanings. For most coders, in relaxed style, it
refers to any byte-oriented character encoding scheme. In C terms,

ASCII == char *


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Chris Angelico
On Thu, May 26, 2016 at 7:11 PM, Marko Rauhamaa  wrote:
> Python didn't come out unscathed, either. Multithreading is being
> replaced with asyncio

Incorrect. Threading is still important - it's not being replaced.
Asynchronous code support is being added to an existing pool of
multiprocessing techniques, so you can now use preemptive processes or
threads, or cooperative asyncio, depending on what you need.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Marko Rauhamaa
Jussi Piitulainen :

> UTF-16 ASCII is weird. Wierd. Probably all right in an environment
> that is otherwise set to use UTF-16.
>
> Nothing is as weird as a mix of different encodings of a foreign
> script in the same "plain text" file, said to be "Unicode". 

Some children are just born under unlucky stars. Windows and Java are
among them. If they had been designed a few years earlier or a few years
later, they could have evaded the UTF-16 embarrassment, maybe the
multithreading embarrassment as well.

Python didn't come out unscathed, either. Multithreading is being
replaced with asyncio, and Python 3 broke backward-compatibility to get
Unicode right.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Erik

On 26/05/16 02:28, Dennis Lee Bieber wrote:

On Wed, 25 May 2016 22:03:34 +0100, Erik 
declaimed the following:


Indeed - at that time, I was working with COBOL on an IBM S/370. On that
system, we used EBCDIC ASCII. That was the wierdest ASCII of all  ;)


It would have to be... Extended Binary Coded Decimal Interchange Code,
as I recall, predates American Standard Code for Information Interchange.

EBCDIC's 8-bit code is actually more closely linked to Hollerith card
encodings.


I really didn't think it would be necessary to point this out (I thought 
the "" and emoji would be enough), but for the record, my 
previous message was clearly a joke.


To break it down, Stephen was making the observation that people call 
all sorts of extended ASCII encodings (including proprietary things) 
"ASCII". So I took it to the extreme and called something that had 
nothing to do with ASCII a type of ASCII.


As they say, if one has to explain one's jokes then they are probably 
not funny ...


 :(

E.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Rustom Mody
On Thursday, May 26, 2016 at 12:52:09 PM UTC+5:30, Jussi Piitulainen wrote:
> UTF-16 ASCII is weird. Wierd. Probably all right in an environment that
> is otherwise set to use UTF-16.

In http://blog.languager.org/2015/03/whimsical-unicode.html
are some examples of why UTF-16 is bug-inviting
[ section is "Wide is too narrow" ]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-26 Thread Jussi Piitulainen
Erik writes:

> On 25/05/16 11:19, Steven D'Aprano wrote:
>> On Wednesday 25 May 2016 19:10, Christopher Reimer wrote:
>>
>>> Back in the early 1980's, I grew up on 8-bit processors and latin-1
>>> was all we had for ASCII.
>>
>> It really, truly wasn't. But you can be forgiven for not knowing
>> that, since until the rise of the public Internet most people weren't
>> exposed to more than one code page or encoding, and it was incredibly
>> common for people to call *any* encoding "ASCII".
>
> Indeed - at that time, I was working with COBOL on an IBM S/370. On
> that system, we used EBCDIC ASCII. That was the wierdest ASCII of all
>  ;)

UTF-8 ASCII is nice.

UTF-16 ASCII is weird. Wierd. Probably all right in an environment that
is otherwise set to use UTF-16.

Nothing is as weird as a mix of different encodings of a foreign script
in the same "plain text" file, said to be "Unicode". 
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-25 Thread Erik

On 25/05/16 11:19, Steven D'Aprano wrote:

On Wednesday 25 May 2016 19:10, Christopher Reimer wrote:


Back in the early 1980's, I grew up on 8-bit processors and latin-1 was all
we had for ASCII.


It really, truly wasn't. But you can be forgiven for not knowing that, since
until the rise of the public Internet most people weren't exposed to more than
one code page or encoding, and it was incredibly common for people to call
*any* encoding "ASCII".


Indeed - at that time, I was working with COBOL on an IBM S/370. On that 
system, we used EBCDIC ASCII. That was the wierdest ASCII of all  ;)


E.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Exended ASCII and code pages [was Re: for / while else doesn't make sense]

2016-05-25 Thread Chris Angelico
On Wed, May 25, 2016 at 8:19 PM, Steven D'Aprano
 wrote:
> While the code page system was necessary at
> the time, the legacy of them today continues to plague computer users, causing
> moji-bake, errors on file systems[1], and holding back the adoption of 
> Unicode.
>
> [1] I'm speaking from experience there. Take files created on a Windows 
> machine
> using some legacy code page, and try to copy them to another server using
> Unicode, and depending on the intelligence of the server, you may not be able
> to copy them. On the flip side, there are many file names I can easily create
> on Linux but cannot copy to a FAT file system.

And getting a .zip file from a Windows user that had a file in it
called "Café Sounds.something", extracting it on Linux, and finding it
called "Caf\xe9" or something. Very annoying. Fortunately it was only
the one file in a large directory.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list