Re: Python 3.7+ cannot print unicode characters when output is redirected to file - is this a bug?

2022-11-13 Thread Eryk Sun
On 11/13/22, Jessica Smith <12jessicasmit...@gmail.com> wrote:
> Consider the following code ran in Powershell or cmd.exe:
>
> $ python -c "print('└')"
> └
>
> $ python -c "print('└')" > test_file.txt
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in
> encode
> return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u2514' in
> position 0: character maps to 

If your applications and existing data files are compatible with using
UTF-8, then in Windows 10+ you can modify the administrative regional
settings in the control panel to force using UTF-8. In this case,
GetACP() and GetOEMCP() will return CP_UTF8 (65001), and the reserved
code page constants CP_ACP (0),  CP_OEMCP (1), CP_MACCP (2), and
CP_THREAD_ACP (3) will use CP_UTF8.

You can override this on a per-application basis via the
ActiveCodePage setting in the manifest:

https://learn.microsoft.com/en-us/windows/win32/sbscs/application-manifests#activecodepage

In Windows 10, this setting only supports "UTF-8". In Windows 11, it
also supports "legacy" to allow old applications to run on a system
that's configured to use UTF-8.  Setting an explicit locale is also
supported in Windows 11, such as "en-US", with fallback to UTF-8 if
the given locale has no legacy code page.

Note that setting the system to use UTF-8 also affects the host
process for console sessions (i.e. conhost.exe or openconsole.exe),
since it defaults to using the OEM code page (UTF-8 in this case).
Unfortunately, a legacy read from the console host does not support
reading non-ASCII text as UTF-8. For example:

>>> os.read(0, 6)
SPĀM
b'SP\x00M\r\n'

This is a trivial bug in the console host, which stems from the fact
that UTF-8 is a multibyte encoding (1-4 bytes per code), but for some
reason the console team at Microsoft still hasn't fixed it. You can
use chcp.com to set the console's input and output code pages to
something other than UTF-8 if you have to read non-ASCII input in a
legacy console app. By default, this problem doesn't affect Python's
sys.stdin, which internally uses wide-character ReadConsoleW() with
the system's native text encoding, UTF-16LE.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.7+ cannot print unicode characters when output is redirected to file - is this a bug?

2022-11-13 Thread Thomas Passin

On 11/13/2022 9:49 AM, Jessica Smith wrote:

Consider the following code ran in Powershell or cmd.exe:

$ python -c "print('└')"
└

$ python -c "print('└')" > test_file.txt
Traceback (most recent call last):
   File "", line 1, in 
   File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in encode
 return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2514' in
position 0: character maps to 

Is this a known limitation of Windows + Unicode? I understand that
using -x utf8 would fix this, or modifying various environment
variables. But is this expected for a standard Python installation on
Windows?

Jessica



This also fails with the same error:

$ python -c "print('└')" |clip
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python 3.7+ cannot print unicode characters when output is redirected to file - is this a bug?

2022-11-13 Thread Barry


> On 13 Nov 2022, at 14:52, Jessica Smith <12jessicasmit...@gmail.com> wrote:
> 
> Consider the following code ran in Powershell or cmd.exe:
> 
> $ python -c "print('└')"
> └
> 
> $ python -c "print('└')" > test_file.txt
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in encode
>return codecs.charmap_encode(input,self.errors,encoding_table)[0]
> UnicodeEncodeError: 'charmap' codec can't encode character '\u2514' in
> position 0: character maps to 
> 
> Is this a known limitation of Windows + Unicode? I understand that
> using -x utf8 would fix this, or modifying various environment
> variables. But is this expected for a standard Python installation on
> Windows?

Your other thread has a reply that explained this.
It is a problem with windows and character sets.
You have to set things up to allow Unicode to work.

Barry

> 
> Jessica
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Python 3.7+ cannot print unicode characters when output is redirected to file - is this a bug?

2022-11-13 Thread Jessica Smith
Consider the following code ran in Powershell or cmd.exe:

$ python -c "print('└')"
└

$ python -c "print('└')" > test_file.txt
Traceback (most recent call last):
  File "", line 1, in 
  File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2514' in
position 0: character maps to 

Is this a known limitation of Windows + Unicode? I understand that
using -x utf8 would fix this, or modifying various environment
variables. But is this expected for a standard Python installation on
Windows?

Jessica
-- 
https://mail.python.org/mailman/listinfo/python-list


[Python-announce] ANN: unicode 2.9

2022-06-03 Thread garabik-news-2005-05

unicode is a simple python command line utility that displays
properties for a given unicode character, or searches
unicode database for a given name.

It was written with Linux in mind, but should work almost everywhere
(including MS Windows and MacOSX), UTF-8 console is recommended.

˙pɹɐpuɐʇs əpoɔı̣uՈ əɥʇ ɟo əsn pəɔuɐʌpɐ
puɐ səldı̣ɔuı̣ɹd əɥʇ ɓuı̣ʇɐɹʇsuoɯəp looʇ ɔı̣ʇɔɐpı̣p ʇuəlləɔxə uɐ sı̣ ʇI
˙sʇuı̣odəpoɔ ʇuəɹəɟɟı̣p ʎləʇəldɯoɔ ɓuı̣sn əlı̣ɥʍ 'sɥdʎlɓ ɟo ɯɐəɹʇs ɹɐlı̣ɯı̣s
ʎllɐnsı̣ʌ  oʇuı̣ ʇxəʇ əɥʇ ʇɹəʌuoɔ oʇ pɹɐpuɐʇs əpoɔı̣uՈ əɥʇ ɟo ɹəʍod llnɟ
əɥʇ sʇı̣oldxə ʇɐɥʇ 'ʎʇı̣lı̣ʇn ,əpoɔɐɹɐd, oslɐ suı̣ɐʇuoɔ əɓɐʞɔɐd əɥ⊥

Changes since previous versions:
 * better handling of changes in data files

URL: http://kassiopeia.juls.savba.sk/~garabik/software/unicode.html

License: GPL v3

Installation: pip install unicode

-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-30 Thread Chris Angelico
On Sun, 1 May 2022 at 00:03, Vlastimil Brom  wrote:
> (Even the redundant u prefix from your python2 sample is apparently
> accepted, maybe for compatibility reasons.)

Yes, for compatibility reasons. It wasn't accepted in Python 3.0, but
3.3 re-added it to make porting easier. It doesn't do anything.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-30 Thread Vlastimil Brom
čt 28. 4. 2022 v 13:33 odesílatel Stephen Tucker
 napsal:
>
> Hi PythonList Members,
>
> Consider the following log from a run of IDLE:
>
> ==
>
> Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
> on win32
> Type "copyright", "credits" or "license()" for more information.
> >>> print (u"\u2551")
> ║
> >>> print ([u"\u2551"])
> [u'\u2551']
> >>>
>
> ==
>
> Yes, I am still using Python 2.x - I have good reasons for doing so and
> will be moving to Python 3.x in due course.
>
> I have the following questions arising from the log:
>
> 1. Why does the second print statement not produce [ ║]  or ["║"] ?
>
> 2. Should the second print statement produce [ ║]  or ["║"] ?
>
> 3. Given that I want to print a list of Unicode strings so that their
> characters are displayed (instead of their Unicode codepoint definitions),
> is there a more Pythonic way of doing it than concatenating them into a
> single string and printing that?
>
> 4. Does Python 3.x exhibit the same behaviour as Python 2.x in this respect?
>
> Thanks in anticipation.
>
> Stephen Tucker.
> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
I'm not sure, whether I am not misunderstanding the 4th question or
the answers to it (it is not clear to me, whether the focus is on
character printing or the quotation marks...);
in either case, in python3 the character glyphs are printed in these
cases, instead of the codepoint number notation, cf.:
==
Python 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (
AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print ([u"\u2551"])
['║']
>>>
>>> print([u"\u2551"])
['║']
>>> print("\u2551")
║
>>> print("║")
║
>>> print(repr("\u2551"))
'║'
>>> print(ascii("\u2551"))
'\u2551'
>>>
==

(Even the redundant u prefix from your python2 sample is apparently
accepted, maybe for compatibility reasons.)

hth,
   vbr
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-28 Thread Rob Cliffe via Python-list



On 28/04/2022 14:27, Stephen Tucker wrote:

To Cameron Simpson,

Thanks for your in-depth and helpful reply. I have noted it and will be
giving it close attention when I can.

The main reason why I am still using Python 2.x is that my colleagues are
still using a GIS system that has a Python programmer's interface - and
that interface uses Python 2.x.

The team are moving to an updated version of the system whose Python
interface is Python 3.x.

However, I am expecting to retire over the next 8 months or so, so I do not
need to be concerned with Python 3.x - my successor will be doing that.


Still, if you're feeling noble, you could start the work of making your 
code Python 3 compatible.😁

Best wishes
Rob Cliffe
--
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-28 Thread Jon Ribbens via Python-list
On 2022-04-28, Stephen Tucker  wrote:
> Hi PythonList Members,
>
> Consider the following log from a run of IDLE:
>
>==
>
> Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
> on win32
> Type "copyright", "credits" or "license()" for more information.
>>>> print (u"\u2551")
> ║
>>>> print ([u"\u2551"])
> [u'\u2551']
>>>>
>
>==
>
> Yes, I am still using Python 2.x - I have good reasons for doing so and
> will be moving to Python 3.x in due course.
>
> I have the following questions arising from the log:
>
> 1. Why does the second print statement not produce [ ║]  or ["║"] ?

print(x) implicitly calls str(x) to convert 'x' to a string for output.
lists don't have their own str converter, so fall back to repr instead,
which outputs '[', followed by the repr of each list item separated by
', ', followed by ']'.

> 2. Should the second print statement produce [ ║]  or ["║"] ?

There's certainly no obvious reason why it *should*, and pretty decent
reasons why it shouldn't (it would be a hybrid mess of Python-syntax
repr output and raw string output).

> 3. Given that I want to print a list of Unicode strings so that their
> characters are displayed (instead of their Unicode codepoint definitions),
> is there a more Pythonic way of doing it than concatenating them into a
> single string and printing that?

print(' '.join(list_of_strings)) is probably most common. I suppose you
could do print(*list_of_strings) if you like, but I'm not sure I'd call
it "pythonic" as I've never seen anyone do that (that doesn't mean of
course that other people haven't seen it done!) Personally I only tend
to use print() for debugging output.

> 4. Does Python 3.x exhibit the same behaviour as Python 2.x in this respect?

Yes.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-28 Thread Stephen Tucker
To Cameron Simpson,

Thanks for your in-depth and helpful reply. I have noted it and will be
giving it close attention when I can.

The main reason why I am still using Python 2.x is that my colleagues are
still using a GIS system that has a Python programmer's interface - and
that interface uses Python 2.x.

The team are moving to an updated version of the system whose Python
interface is Python 3.x.

However, I am expecting to retire over the next 8 months or so, so I do not
need to be concerned with Python 3.x - my successor will be doing that.

Stephen.


On Thu, Apr 28, 2022 at 2:07 PM Cameron Simpson  wrote:

> On 28Apr2022 12:32, Stephen Tucker  wrote:
> >Consider the following log from a run of IDLE:
> >==
> >
> >Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
> >on win32
> >Type "copyright", "credits" or "license()" for more information.
> >>>> print (u"\u2551")
> >║
> >>>> print ([u"\u2551"])
> >[u'\u2551']
> >>>>
> >==
> >
> >Yes, I am still using Python 2.x - I have good reasons for doing so and
> >will be moving to Python 3.x in due course.
>
> Love to hear those reasons. Not suggesting that they are invalid.
>
> >I have the following questions arising from the log:
> >1. Why does the second print statement not produce [ ║]  or ["║"] ?
>
> Because print() prints the str() or each of its arguments, and str() of
> a list if the same as its repr(), which is a list of the repr()s of
> every item in the list. Repr of a Unicode string looks like what you
> have in Python 2.
>
> >2. Should the second print statement produce [ ║]  or ["║"] ?
>
> Well, to me its behaviour is correct. Do you _want_ to get your Unicode
> glyph? in quotes? That is up to you. But consider: what would be sane
> output if the list contained the string "], [3," ?
>
> >3. Given that I want to print a list of Unicode strings so that their
> >characters are displayed (instead of their Unicode codepoint definitions),
> >is there a more Pythonic way of doing it than concatenating them into a
> >single string and printing that?
>
> You could print them with empty separators:
>
> print(s1, s2, .., sep='')
>
> To do that in Python 2 you need to:
>
> from __future__ import print_function
>
> at the top of your Python file. Then you've have a Python 3 string print
> function. In Python 2, pint is normally a statement and you don't need
> the brackets:
>
> print u"\u2551"
>
> but print() is genuinely better as a function anyway.
>
> >4. Does Python 3.x exhibit the same behaviour as Python 2.x in this
> respect?
>
> Broadly yes, except that all strings are Unicode strings and we don't
> bothing with the leading "u" prefix.
>
> Cheers,
> Cameron Simpson 
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Printing Unicode strings in a list

2022-04-28 Thread Cameron Simpson
On 28Apr2022 12:32, Stephen Tucker  wrote:
>Consider the following log from a run of IDLE:
>==
>
>Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
>on win32
>Type "copyright", "credits" or "license()" for more information.
>>>> print (u"\u2551")
>║
>>>> print ([u"\u2551"])
>[u'\u2551']
>>>>
>==
>
>Yes, I am still using Python 2.x - I have good reasons for doing so and
>will be moving to Python 3.x in due course.

Love to hear those reasons. Not suggesting that they are invalid.

>I have the following questions arising from the log:
>1. Why does the second print statement not produce [ ║]  or ["║"] ?

Because print() prints the str() or each of its arguments, and str() of 
a list if the same as its repr(), which is a list of the repr()s of 
every item in the list. Repr of a Unicode string looks like what you 
have in Python 2.

>2. Should the second print statement produce [ ║]  or ["║"] ?

Well, to me its behaviour is correct. Do you _want_ to get your Unicode 
glyph? in quotes? That is up to you. But consider: what would be sane 
output if the list contained the string "], [3," ?

>3. Given that I want to print a list of Unicode strings so that their
>characters are displayed (instead of their Unicode codepoint definitions),
>is there a more Pythonic way of doing it than concatenating them into a
>single string and printing that?

You could print them with empty separators:

print(s1, s2, .., sep='')

To do that in Python 2 you need to:

from __future__ import print_function

at the top of your Python file. Then you've have a Python 3 string print 
function. In Python 2, pint is normally a statement and you don't need 
the brackets:

print u"\u2551"

but print() is genuinely better as a function anyway.

>4. Does Python 3.x exhibit the same behaviour as Python 2.x in this respect?

Broadly yes, except that all strings are Unicode strings and we don't 
bothing with the leading "u" prefix.

Cheers,
Cameron Simpson 
-- 
https://mail.python.org/mailman/listinfo/python-list


Printing Unicode strings in a list

2022-04-28 Thread Stephen Tucker
Hi PythonList Members,

Consider the following log from a run of IDLE:

==

Python 2.7.10 (default, May 23 2015, 09:40:32) [MSC v.1500 32 bit (Intel)]
on win32
Type "copyright", "credits" or "license()" for more information.
>>> print (u"\u2551")
║
>>> print ([u"\u2551"])
[u'\u2551']
>>>

==

Yes, I am still using Python 2.x - I have good reasons for doing so and
will be moving to Python 3.x in due course.

I have the following questions arising from the log:

1. Why does the second print statement not produce [ ║]  or ["║"] ?

2. Should the second print statement produce [ ║]  or ["║"] ?

3. Given that I want to print a list of Unicode strings so that their
characters are displayed (instead of their Unicode codepoint definitions),
is there a more Pythonic way of doing it than concatenating them into a
single string and printing that?

4. Does Python 3.x exhibit the same behaviour as Python 2.x in this respect?

Thanks in anticipation.

Stephen Tucker.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-04-07 Thread Anssi Saari
Dennis Lee Bieber  writes:

> On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico 
> declaimed the following:
>
>
>>That's jmf. Ignore him. He knows nothing about Unicode and is
>>determined to make everyone aware of that fact.
>>
>>He got blocked from the mailing list ages ago, and I don't think
>>anyone's regretted it.

>   Ah yes... Unfortunately, when gmane made the mirror read-only, I had to
> revert to comp.lang.python... and all the junk that gets in via that and
> Google Groups...

Hm. I just configured my news reader to send follow-ups to the mailing
list when that happened.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-04-01 Thread Chris Angelico
On Fri, 1 Apr 2022 at 11:16, Dennis Lee Bieber  wrote:
>
> On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico 
> declaimed the following:
>
>
> >That's jmf. Ignore him. He knows nothing about Unicode and is
> >determined to make everyone aware of that fact.
> >
> >He got blocked from the mailing list ages ago, and I don't think
> >anyone's regretted it.
> >
> Ah yes... Unfortunately, when gmane made the mirror read-only, I had 
> to
> revert to comp.lang.python... and all the junk that gets in via that and
> Google Groups...
>

Killfiles can help.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Dennis Lee Bieber
On Fri, 1 Apr 2022 03:59:32 +1100, Chris Angelico 
declaimed the following:


>That's jmf. Ignore him. He knows nothing about Unicode and is
>determined to make everyone aware of that fact.
>
>He got blocked from the mailing list ages ago, and I don't think
>anyone's regretted it.
>
Ah yes... Unfortunately, when gmane made the mirror read-only, I had to
revert to comp.lang.python... and all the junk that gets in via that and
Google Groups...


-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Chris Angelico
On Fri, 1 Apr 2022 at 03:45, Dennis Lee Bieber  wrote:
>
> On Thu, 31 Mar 2022 00:36:10 -0700 (PDT), moi 
> declaimed the following:
>
> >>>> 'äÄöÖüÜ'.encode('utf-8')
> >b'\xc3\xa4\xc3\x84\xc3\xb6\xc3\x96\xc3\xbc\xc3\x9c'
> >>>> len('äÄöÖüÜ'.encode('utf-8'))
> >12
> >>>>
> >>>> ?
>
> Is there a question in there somewhere?
>
> Crystal ball is hazy...
>
>     However... Note that once you encode the Unicode literal, you have a
> BYTE string. There are 12 bytes in that binary -- it is NOT considered
> Unicode at that point (only when you decode it with the same CODEC will it
> be Unicode).
>

That's jmf. Ignore him. He knows nothing about Unicode and is
determined to make everyone aware of that fact.

He got blocked from the mailing list ages ago, and I don't think
anyone's regretted it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: 'äÄöÖüÜ' in Unicode (utf-8)

2022-03-31 Thread Dennis Lee Bieber
On Thu, 31 Mar 2022 00:36:10 -0700 (PDT), moi 
declaimed the following:

>>>> 'äÄöÖüÜ'.encode('utf-8')
>b'\xc3\xa4\xc3\x84\xc3\xb6\xc3\x96\xc3\xbc\xc3\x9c'
>>>> len('äÄöÖüÜ'.encode('utf-8'))
>12
>>>> 
>>>> ?

Is there a question in there somewhere?

Crystal ball is hazy...

However... Note that once you encode the Unicode literal, you have a
BYTE string. There are 12 bytes in that binary -- it is NOT considered
Unicode at that point (only when you decode it with the same CODEC will it
be Unicode).


-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ANN: unicode 2.8

2021-01-02 Thread Chris Angelico
On Sun, Jan 3, 2021 at 10:28 AM Terry Reedy  wrote:
> > And when implementing this, it was a no-brainer to include also the
> > brexit varian (verbatim).
>
> I assume you meant 'variation' and not Varian, the maker of scientific
> instruments.

I assumed simple typo for "variant"

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ANN: unicode 2.8

2021-01-02 Thread Terry Reedy

On 1/1/2021 3:48 PM, garabik-news-2005...@kassiopeia.juls.savba.sk wrote:

Terry Reedy  wrote:

On 12/31/2020 9:36 AM, garabik-news-2005...@kassiopeia.juls.savba.sk wrote:

unicode is a simple python command line utility that displays
properties for a given unicode character, or searches
unicode database for a given name.

...

Changes since previous versions:

   * display ASCII table (either traditional with --ascii or the new
 EU–UK Trade and Cooperation Agreement version with --brexit-ascii)


The latter option implied to me that the agreement defines an 
intentional variation on standard ASCII.  I immediately wondered whether 
they had changed the actual 7-bit ascii code, which would be egregiously 
bad, or made yet another variation of 8-bit 'extended ascii', perhaps to 
ensure inclusion both the pound and euro signs.


So I googled 'brexit ascii'.  And was surprised to discover that there 
is no such thing as 'brexit ascii', just yet another cock-up in text 
preparation.  (I have seen worse when a digital text of mine was mangled 
during markup.  Fortunately, I was allowed to read the page proofs.  But 
I still don't understand how spelling errors were introduced within 
words I had spelled correctly.)



Are you reproducing it with bugs included?
How is that of any use to anyone?


I followed this with links to justify my claim and question:

A tweet linking the treaty annex page
https://twitter.com/thejsa_/status/1343291595899207681

A stackoverflow question and discussion of the bugs and oddities.
https://politics.stackexchange.com/questions/61178/why-does-the-eu-uk-trade-deal-have-the-7-bit-ascii-table-as-an-appendix

In the latter are mentions of other text, perhaps copy-pasted from the 
1990s recommending the now deprecated SHA1 and referring to Netscape 
Navigator 4 as a modern browser.  Clearly, in the rush to finish, the 
annex was not properly reviewed by current technical experts.



Including the (correct) ASCII table has been a long term, low priority -
I am using ascii(1) utility reasonably often and it makes sense to
reproduce this functionality.

And when implementing this, it was a no-brainer to include also the
brexit varian (verbatim).


I assume you meant 'variation' and not Varian, the maker of scientific 
instruments.


But why do you consider it a no-brainer to include nonsense in your 
program and mislead people?  People already have enough trouble dealing 
with text coding.



After all, given the blood and sweat and tears
shed during the negotiations, I am sure each and every line of the
Agreement has been combed and (re)negotiated over and over by experienced
negotiators and verified an army of experts in the fields 


What are we supposed to make of this?  That you already knew that 
'brexit-ascii' is nonsense?



--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: ANN: unicode 2.8

2021-01-01 Thread garabik-news-2005-05
Terry Reedy  wrote:
> On 12/31/2020 9:36 AM, garabik-news-2005...@kassiopeia.juls.savba.sk wrote:
>> unicode is a simple python command line utility that displays
>> properties for a given unicode character, or searches
>> unicode database for a given name.
> ...
>> Changes since previous versions:
>> 
>>   * display ASCII table (either traditional with --ascii or the new
>> EU–UK Trade and Cooperation Agreement version with --brexit-ascii)
> 
> Are you reproducing it with bugs included?
> How is that of any use to anyone?

Including the (correct) ASCII table has been a long term, low priority -
I am using ascii(1) utility reasonably often and it makes sense to
reproduce this functionality.

And when implementing this, it was a no-brainer to include also the
brexit varian (verbatim). After all, given the blood and sweat and tears
shed during the negotiations, I am sure each and every line of the
Agreement has been combed and (re)negotiated over and over by experienced
negotiators and verified an army of experts in the fields 

-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ANN: unicode 2.8

2020-12-31 Thread Terry Reedy

On 12/31/2020 9:36 AM, garabik-news-2005...@kassiopeia.juls.savba.sk wrote:

unicode is a simple python command line utility that displays
properties for a given unicode character, or searches
unicode database for a given name.

...

Changes since previous versions:

  * display ASCII table (either traditional with --ascii or the new
EU–UK Trade and Cooperation Agreement version with --brexit-ascii)


Are you reproducing it with bugs included?
How is that of any use to anyone?
A tweet linking the treaty annex page
https://twitter.com/thejsa_/status/1343291595899207681
A stackoverflow question and discussion of the bugs and oddities.
https://politics.stackexchange.com/questions/61178/why-does-the-eu-uk-trade-deal-have-the-7-bit-ascii-table-as-an-appendix

The likely answer is that the treaty writers copy-pasted from 
decades-old docs and could not be bothered to link to the actual ISO 
standard.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


ANN: unicode 2.8

2020-12-31 Thread garabik-news-2005-05
unicode is a simple python command line utility that displays
properties for a given unicode character, or searches
unicode database for a given name.

It was written with Linux in mind, but should work almost everywhere
(including MS Windows and MacOSX), UTF-8 console is recommended.

˙pɹɐpuɐʇs əpoɔı̣uՈ əɥʇ ɟo əsn pəɔuɐʌpɐ
puɐ səldı̣ɔuı̣ɹd əɥʇ ɓuı̣ʇɐɹʇsuoɯəp looʇ ɔı̣ʇɔɐpı̣p ʇuəlləɔxə uɐ sı̣ ʇI
˙sʇuı̣odəpoɔ ʇuəɹəɟɟı̣p ʎləʇəldɯoɔ ɓuı̣sn əlı̣ɥʍ 'sɥdʎlɓ ɟo ɯɐəɹʇs ɹɐlı̣ɯı̣s
ʎllɐnsı̣ʌ  oʇuı̣ ʇxəʇ əɥʇ ʇɹəʌuoɔ oʇ pɹɐpuɐʇs əpoɔı̣uՈ əɥʇ ɟo ɹəʍod llnɟ
əɥʇ sʇı̣oldxə ʇɐɥʇ 'ʎʇı̣lı̣ʇn ,əpoɔɐɹɐd, oslɐ suı̣ɐʇuoɔ əɓɐʞɔɐd əɥ⊥

Changes since previous versions:

 * display ASCII table (either traditional with --ascii or the new
   EU–UK Trade and Cooperation Agreement version with --brexit-ascii)
 * minor bug fixes

URL: http://kassiopeia.juls.savba.sk/~garabik/software/unicode.html

License: GPL v3

Installation: pip install unicode

-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-17 Thread Terry Reedy

On 6/16/2020 7:45 PM, DL Neil via Python-list wrote:

On 13/06/20 4:47 AM, Terry Reedy wrote:
There was a recent thread on python-ideas discussing this.  It started 
with arrow characters.  There have been others.


Am pleased to hear that it's neither 'new' nor 'way out there'...


The idea has been rejected multiple times, which puts you in good 
company (in a sense).


Am not subscribed to that list. Went looking for its archives, but 
failed - there's no "ideas" on 
(https://mail.python.org/mailman/listinfo). Please send a pointer...


Try mailman3.
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Archive link on page.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-16 Thread DL Neil via Python-list
There was a recent thread on python-ideas discussing this.  It started 
with arrow characters.  There have been others.



Am pleased to hear that it's neither 'new' nor 'way out there'...

Am not subscribed to that list. Went looking for its archives, but 
failed - there's no "ideas" on 
(https://mail.python.org/mailman/listinfo). Please send a pointer...



Apologies!
Eventually remembered the second list of lists - the list of Python 
lists which are Python lists but not on the first list of Python 
lists... No wonder I'm dizzy!

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-16 Thread DL Neil via Python-list

On 13/06/20 5:11 AM, Dennis Lee Bieber wrote:

On Fri, 12 Jun 2020 18:03:55 +1200, DL Neil via Python-list
 declaimed the following:



There is/was a language called "APL" (and yes the acronym means "A
Programming Language", and yes it started the craze, through "B" (and
BCPL), and yes, that brought us "C" - which you are more likely to have
heard about - and yes then there were DataSci folk, presumably more
numerate than literate, who thought the next letter to be "R". So, sad!?).


R was preceded by S http://www.unige.ch/ses/sococ/cl/r/srdiff.e.html
https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-are-the-differences-between-R-and-S_003f
(which, with some scrolling, produces...


Oh dear, my sarcasm about being literately-challenged stands!



APL was hopelessly keyboard-unfriendly, requiring multiple key-presses
or 'over-typing' to produce those arithmetic-operator symbols -


Not with a Tektronix APL terminal, and Xerox CP/V APL 



Specific design-for-purpose - hardware/software integration!



remember, much of this was on mainframe 3270-style terminals, although
later PC-implementations have existed (I can't comment on how 'active'
any community might be). The over-typing was necessary to encode/produce
the APL symbols which don't exist on a standard typewriter keyboard. Ugh!


Many implementations also allowed for a spelled out version for special
characters... $RHO for example, for the greek letter rho.


To which my first reaction was "ugh!". However, I often prefer to have a 
named constant in my Python code - instead of "magic numbers", eg


LINE_WIDTH = 79 # PEP-8 source-code characters per line



I'm glad to have limited my APL-exposure to only reading about it during
a 'Programming Languages' topic! (If you are 'into' functional
programming you may like to explore further)


I used it as a 3-credit independent study in my senior year (1980). All
I was after was a passing grade to complete the credits for graduation. I'm
slightly ashamed to admit that my fanciest program turned that Tektronix
storage display tube terminal into a glorified Etch-a-Sketch (terminal had
X/Y scroll wheels that the APL implementation could read).


Hey, at least you gained access. I think my uni (when I was an u/grad) 
only had one graphic terminal which was kept in the computer room and 
thus only staff had access.


Our introduction to graphics (using FORTRAN) had to be shown using 80x24 
character-based terminals (DEC VT-52s, from memory). Drawing shapes was 
bad-enough, but demonstrations of rotation and translation became the 
very definition of ugly!


I've been somewhat re-living those days, teaching myself how to play 
with Pygame (not a 'work' activity!), and learning how to move entities 
around on the screen (quite similar to HTML5, but sufficiently different 
to give pause). That said, the learning of such basic "building-blocks", 
four-plus decades ago, under-pins working in both/either/each, today!

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-16 Thread DL Neil via Python-list

On 13/06/20 4:47 AM, Terry Reedy wrote:

On 6/12/2020 2:03 AM, DL Neil via Python-list wrote:
Unicode has given us access to a wealth of mathematical and other 
symbols. Hardware and soft-/firm-ware flexibility enable us to move 
beyond and develop new 'standards'. Do we have opportunities to make 
computer programming more math-familiar and/or more 
logically-expressive, and thus easier to learn and practice? Could we 
develop Python to take advantage of these opportunities?


...

Could we then also 'update' Python, to accept the wider range of 
symbols instead/in-addition to those currently in-use?


Would such even constitute 'a good idea'?


There was a recent thread on python-ideas discussing this.  It started 
with arrow characters.  There have been others.



Am pleased to hear that it's neither 'new' nor 'way out there'...

Am not subscribed to that list. Went looking for its archives, but 
failed - there's no "ideas" on 
(https://mail.python.org/mailman/listinfo). Please send a pointer...

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-12 Thread Terry Reedy

On 6/12/2020 2:03 AM, DL Neil via Python-list wrote:
Unicode has given us access to a wealth of mathematical and other 
symbols. Hardware and soft-/firm-ware flexibility enable us to move 
beyond and develop new 'standards'. Do we have opportunities to make 
computer programming more math-familiar and/or more 
logically-expressive, and thus easier to learn and practice? Could we 
develop Python to take advantage of these opportunities?


...

Could we then also 'update' Python, to accept the wider range of symbols 
instead/in-addition to those currently in-use?


Would such even constitute 'a good idea'?


There was a recent thread on python-ideas discussing this.  It started 
with arrow characters.  There have been others.



--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-12 Thread Chris Angelico
On Fri, Jun 12, 2020 at 9:11 PM Elliott Roper  wrote:
>
> On 12 Jun 2020 at 09:47:04 BST, "moi"  wrote:
> i) Who cares?

Don't bother responding to him. He's somehow gotten the idea that
Python's Unicode support is broken, and he spews his vomit out onto
the newsgroup periodically. He's blocked from the mailing list, and
for good reason. Ignore him and save yourself the hassle.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Friday Finking: Beyond implementing Unicode

2020-06-12 Thread Elliott Roper
On 12 Jun 2020 at 09:47:04 BST, "moi"  wrote:

> i) Today there people, who are still not understanding this:
> 
 'Å'.encode('utf-8')
> b'\xc3\x85'
 'Å'.encode('utf-16-le')
> b'\xc5\x00'
 'Å'.encode('utf-32-le')
> b'\xc5\x00\x00\x00'
> 
> ii) On a Western Europen Windows, Py 3 is not even working
> correctly with the *characters* of the Windows-1252 coding
> scheme. (As I understand this issue, you may have the same
> problem on let say an iso-8859-2 platform).
> 
> iii) When it works, I mean when it *by chance* works, the
> result is all by satisfying:
> 
 import timeit
 timeit.timeit("s.encode('utf-8')", "s = 'Universität Zürich' * 1000")
> 50.9616764429
 timeit.timeit("s.encode('utf-8')", "s = 'Universitat Zurich' * 1000")
> 2.488587845973
 
> 
> 
> iv) ...
> v) ...
> vi) ...

i) Who cares?
ii) Breaking News. Windows is mired in backward compatibility.
iii) My 3 year old Mac is 5 times faster than that. Get over it.

Maths always made its greatest advances after notation improved.
Terseness and unambiguity are king.

You are looking backward.
DL Neil is looking forward. A long way forward. It won't be our generation,
our brains are already mis-wired.

-- 
To de-mung my e-mail address:- fsnospam$elliott$$
PGP Fingerprint: 1A96 3CF7 637F 896B C810  E199 7E5C A9E4 8E59 E248


-- 
https://mail.python.org/mailman/listinfo/python-list


Friday Finking: Beyond implementing Unicode

2020-06-11 Thread DL Neil via Python-list
Unicode has given us access to a wealth of mathematical and other 
symbols. Hardware and soft-/firm-ware flexibility enable us to move 
beyond and develop new 'standards'. Do we have opportunities to make 
computer programming more math-familiar and/or more 
logically-expressive, and thus easier to learn and practice? Could we 
develop Python to take advantage of these opportunities?


TLDR;? Skip to the last paragraphs/block...


Back in the ?good, old days, small eight-bit computers advanced beyond 
many of their predecessors, because we could begin to encode characters 
and "string" them together - as well as computing with numbers.


Initially, we used 7-bit ASCII code (on smaller machines - whereas IBM 
mainframes used EBCDIC, etc). ASCII gave us both upper- and lower-case 
letters, digits, special characters, and control codes. Later this was 
extended to 8-bits as "Code Page 1252", whereby MSFT added more special 
characters, superscripts, fractions, currency symbols, and many ordinary 
and combinatorial letters used in other "Romance languages" (European).


Latterly, we have implemented Unicode, which seeks to include all of the 
world's scripts and languages and may employ multiple bytes per 
'character'. (simplification)


A massive effort went into Python (well done PyDevs!), and the adoption 
of Unicode in-particular, made Python 3 a less-than seamless upgrade 
from Python 2. However, 'standing upon the shoulders of giants', we can 
now take advantage of Unicode both as an encoding for data files, and 
within the code of our own Python applications. We don't often see 
examples of the latter, eg


>>> π = 3.14159
>>> r = 1
>>> circumference = 2 * π * r
>>> print( circumference )
6.28318

>>> Empfänger = "dn"# Addressee/recipient
>>> Straßenname = "Lansstraße"  # Street name
>>> Immobilien_Hausnummer = "42"# Building/house number

(whilst the above is valid German, I have 'cheated' in order to add 
suitable characters - for the purposes of illustration to 
EN-monolinguals - apologies for any upset to your sense of "ordnung" - 
please consider the meaning of "42" to restore yourself...)



However, we are still shackled to an history where an asterisk (*) is 
used as the multiplication symbol, because "x" was an ASCII letter. 
Similarly, we have the ** for an exponential operator, because we didn't 
have superscripts (per algebraic expression). Worse, we made "=" mean: 
'use the identifier to the left to represent the right-hand-side 
value-result', ie "Let" or "Set" - this despite left-to-right expression 
making it more logical to say: 'transfer this (left-side) value to the 
part on the right', ie 'give all of the chocolate cake to me', as well 
as 'robbing' us of the symbol's usual meaning of "equality" (in Python 
that had to become the "==" symbol). Don't let me get started on "!" 
(exclamation/surprise!) meaning "not"!



There is/was a language called "APL" (and yes the acronym means "A 
Programming Language", and yes it started the craze, through "B" (and 
BCPL), and yes, that brought us "C" - which you are more likely to have 
heard about - and yes then there were DataSci folk, presumably more 
numerate than literate, who thought the next letter to be "R". So, sad!?).


The point of mentioning APL? It allowed the likes of:

AREA←PI×RADIUS⋆2

APL was hopelessly keyboard-unfriendly, requiring multiple key-presses 
or 'over-typing' to produce those arithmetic-operator symbols - 
remember, much of this was on mainframe 3270-style terminals, although 
later PC-implementations have existed (I can't comment on how 'active' 
any community might be). The over-typing was necessary to encode/produce 
the APL symbols which don't exist on a standard typewriter keyboard. Ugh!


I'm glad to have limited my APL-exposure to only reading about it during 
a 'Programming Languages' topic! (If you are 'into' functional 
programming you may like to explore further)



Turning now to "hardware" and the subtle 'limitations' it imposes upon us.

PC-users (see also Apple, and glass-keyboard users) have become wedded 
to the 'standard' 101~105-key "QWERTY"/"AZERTY"/etc keyboards (again, 
restricting myself to European languages - with due apologies). Yet, 
there exists a variety of ways to implement the 'standard', as well as a 
range of other keyboard layouts. Plus we have folk experimenting with 
SBCs, eg Raspberry Pi; learning how to interpret low-level hardware, ie 
key-presses and keyboard "arr

Re: ÿ in Unicode

2020-03-07 Thread Grant Edwards
On 2020-03-07, Jon Ribbens via Python-list  wrote:
> On 2020-03-06, Jon Ribbens  wrote:
>> What's the bug, or source of amusement?
>
> Oh, that's fun. There's a Russian Fidonet gateway, that somehow
> still exists, that's re-injecting usenet posts back into the group.

Last time I think it was one in Australia.

--
Grant



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-07 Thread Richard Damon
On 3/7/20 12:52 PM, Ben Bacarisse wrote:
> moi  writes:
> 
>> Le samedi 7 mars 2020 16:41:10 UTC+1, R.Wieser a écrit :
>>> Moi,
>>>
 Fortunately, UTF-8 has not been created the Python devs.
>>>
>>> And there we go again, making vague statements/accusations - without 
>>> /anything/ to back it up ofcourse
>>>
>>> Kiddo, you have posted a couple of messages now, but have said exactly 
>>> nothing.   Are you sure you do not want to go into politics ?
>>>
>> The day, when this language will stop to interpret a byte
>> as being a Latin-1 (ISO-8859-1) character, this language will
>> start to work properly.
> 
 "ÿ".encode('iso-8859-1')
> b'\xff'
> 

or the reverse.

b'\xff'.decode('iso-8859-1')
'ÿ'

iso-8859-1 just isn't the DEFAULT character encoding to use.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-07 Thread Ben Bacarisse
moi  writes:

> Le samedi 7 mars 2020 16:41:10 UTC+1, R.Wieser a écrit :
>> Moi,
>> 
>> > Fortunately, UTF-8 has not been created the Python devs.
>> 
>> And there we go again, making vague statements/accusations - without 
>> /anything/ to back it up ofcourse
>> 
>> Kiddo, you have posted a couple of messages now, but have said exactly 
>> nothing.   Are you sure you do not want to go into politics ?
>> 
> The day, when this language will stop to interpret a byte
> as being a Latin-1 (ISO-8859-1) character, this language will
> start to work properly.

>>> "ÿ".encode('iso-8859-1')
b'\xff'

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-07 Thread R.Wieser
Moi,

> Fortunately, UTF-8 has not been created the Python devs.

And there we go again, making vague statements/accusations - without 
/anything/ to back it up ofcourse

Kiddo, you have posted a couple of messages now, but have said exactly 
nothing.   Are you sure you do not want to go into politics ?

Regards,
Rudy Wieser


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-07 Thread R.Wieser
Moi,

> - Today, there are still people who do not understand a
> "ÿ' can not be *safely* encoded with a single byte.

It can (and has been done for ages), just not in the character encoding 
method you've choosen to use.

> - Python == Latin-1 mess (as somebody wrote on a mailing list).

Putting blanket, unsupported statements forward doesn't score you any 
points.  Feel free to come up with examples* though, as well as (ofcourse) 
how you think it could have been done better

*examples in regard how Python fails to use the character encoding according 
to its definition.   Any complaint towards the encoding itself doesn't 
belong in this newsgroup.

But, seeing that you started this thread with posting stuff that actually 
works**as advertised I won't hold my breath.

**instead of supporting your 'Python makes a mess of it' stance. Which 
ofcourse suggests that that example is actually the worst thing you could 
come up with - but only shows both Python and the UTF-x encodings working as 
expected.

> - This "Flexible string representation" succeded to reintroduced
> the mess of the coding of characters.

Nope.   That /you/ don't understand how the UTF-x character encoding works 
doesn't mean others do not either.

> Once you get this, it's a child play to produce failing Python code.
> Python approach.

Do you know how to take a car engine apart and rebuild it ?No ?   Than 
you also suck at cooking food, am I right ?   (You eat in the car, you 
transport food by it.  The connection is /obviously/ there :-) )

> Other possibility, take a "utf-NNN tool" (lib), C# (Powershell),
> golang and show these tools are correctly working where Python
> fails for the same task.

Kiddo, all I have seen you do is to suggest that UTF encoding is bad(1), and 
by association Python is bad(2), by making some reference to other programs 
that do it better(3) and where Python fails(4)

(1),(2),(3),(4) - None of which are underbuild, let alone proven.   In 
short, hollow and meaningless drivel.  Acceptable for a politician, but not 
for a programmer/scripter.

> A real funny mess. Very amusing.

Oh well, you at least get /some/ enjoyment outof knowing* the Python 
language.

* I'm just assuming you are not actually /using/ it, as its so bad and you 
got a range of better languages at your disposal.

But that does make me wonder why you are posting here to start with.

Although, I think I can guess ...

Regards,
Rudy Wieser


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, Jon Ribbens  wrote:
> What's the bug, or source of amusement?

Oh, that's fun. There's a Russian Fidonet gateway, that somehow
still exists, that's re-injecting usenet posts back into the group.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Chris Angelico
On Fri, Mar 6, 2020 at 9:31 PM Ben Bacarisse  wrote:
>
> moi  writes:
>
> > Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄäCcrit :
> >> moi  writes:
> >>
> >>  'Ääâ¿'.encode('utf-8')
> >> > b'\xc3\xbf'
> >>  'Ääâ¿'.encode('utf-16-le')
> >> > b'\xff\x00'
> >>  'Ääâ¿'.encode('utf-32-le')
> >> > b'\xff\x00\x00\x00'
> >>
> >
> >> That all looks as expected.
> > Yes
> >
> >>Is there something about the output that puzzles you?
> > No
> >
> >>Did you have a question?
> > No, only a comment
> >
> > This buggy language is very amusing.
>
> Whilst I am happy that you are entertained by Python, the ability to
> encode strings in various transfer formats does not strike me as being
> particularly amusing.  But there's little enough happiness in the world,
> so take it where you can!
>

FYI he's blocked from the mailing list and is in most people's
killfiles. Ignore him - he never has anything useful to say, and his idea of 
"buggy" disagrees with, well, the whole rest of the world.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Ben Bacarisse
moi  writes:

> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄCcrit :
>> moi  writes:
>>
>>  'Ä¿'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'Ä¿'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'Ä¿'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>>
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

Whilst I am happy that you are entertained by Python, the ability to encode 
strings in various transfer formats does not strike me as being particularly 
amusing.  But there's little enough happiness in the world, so take it where 
you can!

--
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, Pieter van Oostrum  wrote:
> Jon Ribbens  writes:
>> On 2020-03-06, moi  wrote:
>>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄäCcritÄø :
>>>> moi  writes:
>>>> >>>> 'Ääâ¿'.encode('utf-8')
>>>> > b'\xc3\xbf'
>>>> >>>> 'Ääâ¿'.encode('utf-16-le')
>>>> > b'\xff\x00'
>>>> >>>> 'Ääâ¿'.encode('utf-32-le')
>>>> > b'\xff\x00\x00\x00'
>>>
>>>> That all looks as expected.
>>> Yes
>>>
>>>>Is there something about the output that puzzles you?
>>> No
>>>
>>>>Did you have a question?
>>> No, only a comment
>>>
>>> This buggy language is very amusing.
>>
>> What's the bug, or source of amusement?
>
> The bug is in the mental world of the OP.

Quite possibly. I must admit I was just interested to learn what they thought 
was wrong or amusing in the above. There's plenty of room to have reasonable 
differing opinions on Unicode strings and how they're implemented in languages, 
but it's not at all obvious what could be different in those specific 
expressions.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Pieter van Oostrum
Jon Ribbens  writes:

> On 2020-03-06, moi  wrote:
>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄäCcritÄø :
>>> moi  writes:
>>>  'Ääâ¿'.encode('utf-8')
>>> > b'\xc3\xbf'
>>>  'Ääâ¿'.encode('utf-16-le')
>>> > b'\xff\x00'
>>>  'Ääâ¿'.encode('utf-32-le')
>>> > b'\xff\x00\x00\x00'
>>
>>> That all looks as expected.
>> Yes
>>
>>>Is there something about the output that puzzles you?
>> No
>>
>>>Did you have a question?
>> No, only a comment
>>
>> This buggy language is very amusing.
>
> What's the bug, or source of amusement?

The bug is in the mental world of the OP.
--
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, moi  wrote:
> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄäCcritÄø :
>> moi  writes:
>>  'Ääâ¿'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'Ääâ¿'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'Ääâ¿'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

What's the bug, or source of amusement?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Pieter van Oostrum
Jon Ribbens  writes:

> On 2020-03-06, moi  wrote:
>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄCcritâ :
>>> moi  writes:
>>>  'Ä¿'.encode('utf-8')
>>> > b'\xc3\xbf'
>>>  'Ä¿'.encode('utf-16-le')
>>> > b'\xff\x00'
>>>  'Ä¿'.encode('utf-32-le')
>>> > b'\xff\x00\x00\x00'
>>
>>> That all looks as expected.
>> Yes
>>
>>>Is there something about the output that puzzles you?
>> No
>>
>>>Did you have a question?
>> No, only a comment
>>
>> This buggy language is very amusing.
>
> What's the bug, or source of amusement?

The bug is in the mental world of the OP.
--
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Ben Bacarisse
moi  writes:

> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
>> moi  writes:
>>
>>  'ÿ'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'ÿ'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'ÿ'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>>
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

Whilst I am happy that you are entertained by Python, the ability to encode 
strings in various transfer formats does not strike me as being particularly 
amusing.  But there's little enough happiness in the world, so take it where 
you can!

--
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Chris Angelico
On Fri, Mar 6, 2020 at 9:31 PM Ben Bacarisse  wrote:
>
> moi  writes:
>
> > Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄCcrit :
> >> moi  writes:
> >>
> >>  'Ä¿'.encode('utf-8')
> >> > b'\xc3\xbf'
> >>  'Ä¿'.encode('utf-16-le')
> >> > b'\xff\x00'
> >>  'Ä¿'.encode('utf-32-le')
> >> > b'\xff\x00\x00\x00'
> >>
> >
> >> That all looks as expected.
> > Yes
> >
> >>Is there something about the output that puzzles you?
> > No
> >
> >>Did you have a question?
> > No, only a comment
> >
> > This buggy language is very amusing.
>
> Whilst I am happy that you are entertained by Python, the ability to
> encode strings in various transfer formats does not strike me as being
> particularly amusing.  But there's little enough happiness in the world,
> so take it where you can!
>

FYI he's blocked from the mailing list and is in most people's
killfiles. Ignore him - he never has anything useful to say, and his idea of 
"buggy" disagrees with, well, the whole rest of the world.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, moi  wrote:
> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄCcritâ :
>> moi  writes:
>>  'Ä¿'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'Ä¿'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'Ä¿'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

What's the bug, or source of amusement?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, Pieter van Oostrum  wrote:
> Jon Ribbens  writes:
>> On 2020-03-06, moi  wrote:
>>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a ÄCcritâ :
>>>> moi  writes:
>>>> >>>> 'Ä¿'.encode('utf-8')
>>>> > b'\xc3\xbf'
>>>> >>>> 'Ä¿'.encode('utf-16-le')
>>>> > b'\xff\x00'
>>>> >>>> 'Ä¿'.encode('utf-32-le')
>>>> > b'\xff\x00\x00\x00'
>>>
>>>> That all looks as expected.
>>> Yes
>>>
>>>>Is there something about the output that puzzles you?
>>> No
>>>
>>>>Did you have a question?
>>> No, only a comment
>>>
>>> This buggy language is very amusing.
>>
>> What's the bug, or source of amusement?
>
> The bug is in the mental world of the OP.

Quite possibly. I must admit I was just interested to learn what they thought 
was wrong or amusing in the above. There's plenty of room to have reasonable 
differing opinions on Unicode strings and how they're implemented in languages, 
but it's not at all obvious what could be different in those specific 
expressions.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, Pieter van Oostrum  wrote:
> Jon Ribbens  writes:
>> On 2020-03-06, moi  wrote:
>>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
>>>> moi  writes:
>>>> >>>> 'ÿ'.encode('utf-8')
>>>> > b'\xc3\xbf'
>>>> >>>> 'ÿ'.encode('utf-16-le')
>>>> > b'\xff\x00'
>>>> >>>> 'ÿ'.encode('utf-32-le')
>>>> > b'\xff\x00\x00\x00'
>>>
>>>> That all looks as expected.
>>> Yes
>>>
>>>>Is there something about the output that puzzles you?
>>> No
>>>
>>>>Did you have a question?
>>> No, only a comment
>>>
>>> This buggy language is very amusing.
>>
>> What's the bug, or source of amusement?
>
> The bug is in the mental world of the OP.

Quite possibly. I must admit I was just interested to learn what
they thought was wrong or amusing in the above. There's plenty of
room to have reasonable differing opinions on Unicode strings and
how they're implemented in languages, but it's not at all obvious
what could be different in those specific expressions.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Pieter van Oostrum
Jon Ribbens  writes:

> On 2020-03-06, moi  wrote:
>> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
>>> moi  writes:
>>>  'ÿ'.encode('utf-8')
>>> > b'\xc3\xbf'
>>>  'ÿ'.encode('utf-16-le')
>>> > b'\xff\x00'
>>>  'ÿ'.encode('utf-32-le')
>>> > b'\xff\x00\x00\x00'
>>
>>> That all looks as expected.
>> Yes
>>
>>>Is there something about the output that puzzles you?
>> No
>>
>>>Did you have a question?
>> No, only a comment
>>
>> This buggy language is very amusing.
>
> What's the bug, or source of amusement?

The bug is in the mental world of the OP.
-- 
Pieter van Oostrum
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Jon Ribbens via Python-list
On 2020-03-06, moi  wrote:
> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
>> moi  writes:
>>  'ÿ'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'ÿ'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'ÿ'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

What's the bug, or source of amusement?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Chris Angelico
On Fri, Mar 6, 2020 at 9:31 PM Ben Bacarisse  wrote:
>
> moi  writes:
>
> > Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
> >> moi  writes:
> >>
> >>  'ÿ'.encode('utf-8')
> >> > b'\xc3\xbf'
> >>  'ÿ'.encode('utf-16-le')
> >> > b'\xff\x00'
> >>  'ÿ'.encode('utf-32-le')
> >> > b'\xff\x00\x00\x00'
> >>
> >
> >> That all looks as expected.
> > Yes
> >
> >>Is there something about the output that puzzles you?
> > No
> >
> >>Did you have a question?
> > No, only a comment
> >
> > This buggy language is very amusing.
>
> Whilst I am happy that you are entertained by Python, the ability to
> encode strings in various transfer formats does not strike me as being
> particularly amusing.  But there's little enough happiness in the world,
> so take it where you can!
>

FYI he's blocked from the mailing list and is in most people's
killfiles. Ignore him - he never has anything useful to say, and his
idea of "buggy" disagrees with, well, the whole rest of the world.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-06 Thread Ben Bacarisse
moi  writes:

> Le jeudi 5 mars 2020 13:20:38 UTC+1, Ben Bacarisse a écrit :
>> moi  writes:
>> 
>>  'ÿ'.encode('utf-8')
>> > b'\xc3\xbf'
>>  'ÿ'.encode('utf-16-le')
>> > b'\xff\x00'
>>  'ÿ'.encode('utf-32-le')
>> > b'\xff\x00\x00\x00'
>> 
>
>> That all looks as expected.
> Yes
>
>>Is there something about the output that puzzles you?
> No
>
>>Did you have a question?
> No, only a comment
>
> This buggy language is very amusing.

Whilst I am happy that you are entertained by Python, the ability to
encode strings in various transfer formats does not strike me as being
particularly amusing.  But there's little enough happiness in the world,
so take it where you can!

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: ÿ in Unicode

2020-03-05 Thread Ben Bacarisse
moi  writes:

 'ÿ'.encode('utf-8')
> b'\xc3\xbf'
 'ÿ'.encode('utf-16-le')
> b'\xff\x00'
 'ÿ'.encode('utf-32-le')
> b'\xff\x00\x00\x00'

That all looks as expected.  Is there something about the output that
puzzles you?  Did you have a question?

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread Chris Angelico
On Sun, Dec 8, 2019 at 8:33 AM Bob van der Poel  wrote:
> Yeah, heard all that before :) But, seriously, I wonder how many short
> (less than 100 lines) programs there are out there written in py2 that will
> not run in py3. Good thing py2 will still be available to be installed for
> many, many years!

If they're that short and people are depending on them, it won't be
too much work to port them. And you gain a huge measure of
reliability: you no longer have to worry about "Unicode filenames" -
or, to be more precise, "non-ASCII filenames" - because everything
will just work.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread Bob van der Poel
On Sat, Dec 7, 2019 at 12:47 PM DL Neil via Python-list <
python-list@python.org> wrote:

> On 8/12/19 5:50 AM, Bob van der Poel wrote:
> > On Sat, Dec 7, 2019 at 4:00 AM Barry Scott 
> wrote:
> >>> On 6 Dec 2019, at 18:17, Bob van der Poel  wrote:
> >>>
> >>> I have some files which came off the net with, I'm assuming, unicode
> >>> characters in the names. I have a very short program which takes the
> >>> filename and puts into an emacs buffer, and then lets me add
> information
> >> to
> >>> that new file (it's a poor man's DB).
> >>>
> >>> Next, I can look up text in the file and open the saved filename.
> >>> Everything works great until I hit those darn unicode filenames.
>
> ...
>
> >> Do you get the error with python 3?
> > I'm running this program on Linux (Ubuntu 19.10) and Python2.
>
> ...
>
> > I've taking the coward's way out and renamed the 1/2 dozen files. Seems
> > that it is when I grab a filename from the DB it is in unicode and the
> the
> > textAtCursor() and then I am trying to open that file using a fork to a
> > pdf-display program. This is all Q&D stuff so I'm going to file it under
> > "mysteries of life" and live with it :)
>
>
> Fair enough, for such small number no other solution could be as
> efficient! My quick-and-dirty 'solution' would only work for (very few)
> 'old data files' being recognised/name-updated using Python3.
>
>
> Insert here: obligatory warning about the deprecation of Python2 at the
> end of this month/year...
>
>
Yeah, heard all that before :) But, seriously, I wonder how many short
(less than 100 lines) programs there are out there written in py2 that will
not run in py3. Good thing py2 will still be available to be installed for
many, many years!


-- 

 Listen to my FREE CD at http://www.mellowood.ca/music/cedars 
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: b...@mellowood.ca
WWW:   http://www.mellowood.ca
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread DL Neil via Python-list

On 8/12/19 5:50 AM, Bob van der Poel wrote:

On Sat, Dec 7, 2019 at 4:00 AM Barry Scott  wrote:

On 6 Dec 2019, at 18:17, Bob van der Poel  wrote:

I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information

to

that new file (it's a poor man's DB).

Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.


...


Do you get the error with python 3?

I'm running this program on Linux (Ubuntu 19.10) and Python2.


...


I've taking the coward's way out and renamed the 1/2 dozen files. Seems
that it is when I grab a filename from the DB it is in unicode and the the
textAtCursor() and then I am trying to open that file using a fork to a
pdf-display program. This is all Q&D stuff so I'm going to file it under
"mysteries of life" and live with it :)



Fair enough, for such small number no other solution could be as 
efficient! My quick-and-dirty 'solution' would only work for (very few) 
'old data files' being recognised/name-updated using Python3.



Insert here: obligatory warning about the deprecation of Python2 at the 
end of this month/year...

--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread Bob van der Poel
On Sat, Dec 7, 2019 at 4:00 AM Barry Scott  wrote:

>
>
> > On 6 Dec 2019, at 18:17, Bob van der Poel  wrote:
> >
> > I have some files which came off the net with, I'm assuming, unicode
> > characters in the names. I have a very short program which takes the
> > filename and puts into an emacs buffer, and then lets me add information
> to
> > that new file (it's a poor man's DB).
> >
> > Next, I can look up text in the file and open the saved filename.
> > Everything works great until I hit those darn unicode filenames.
>
> Yes the names you download are unicode.
> All OS can save that filename to disk these days.
> Can you see the file on disk with the name you expect?
>
> What OS are you using?
>
> >
> > Just to confuse me even more, the error seems to be coming from a bit of
> > tkinter code:
> > if sresults.has_key(textAtCursor):
> >bookname = os.path.expanduser(sresults[textAtCursor].strip())
> >
> > which generates
> >
> >  UnicodeWarning: Unicode equal comparison failed to convert both
> arguments
> > to Unicode - interpreting them as being unequal  if
> > sresults.has_key(textAtCursor):
>
> What version of python are you using? Peter only managed to get the error
> with python 2.
>
> Do you get the error with python 3?
>
>
I'm running this program on Linux (Ubuntu 19.10) and Python2.

>
>
> > I really don't understand the business about "both arguments". Not sure
> how
> > to proceed here. Hoping for a guideline!
>
>
I've taking the coward's way out and renamed the 1/2 dozen files. Seems
that it is when I grab a filename from the DB it is in unicode and the the
textAtCursor() and then I am trying to open that file using a fork to a
pdf-display program. This is all Q&D stuff so I'm going to file it under
"mysteries of life" and live with it :)

Thanks all!

-- 

 Listen to my FREE CD at http://www.mellowood.ca/music/cedars 
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: b...@mellowood.ca
WWW:   http://www.mellowood.ca
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread Barry Scott



> On 6 Dec 2019, at 18:17, Bob van der Poel  wrote:
> 
> I have some files which came off the net with, I'm assuming, unicode
> characters in the names. I have a very short program which takes the
> filename and puts into an emacs buffer, and then lets me add information to
> that new file (it's a poor man's DB).
> 
> Next, I can look up text in the file and open the saved filename.
> Everything works great until I hit those darn unicode filenames.

Yes the names you download are unicode.
All OS can save that filename to disk these days.
Can you see the file on disk with the name you expect?

What OS are you using? 

> 
> Just to confuse me even more, the error seems to be coming from a bit of
> tkinter code:
> if sresults.has_key(textAtCursor):
>bookname = os.path.expanduser(sresults[textAtCursor].strip())
> 
> which generates
> 
>  UnicodeWarning: Unicode equal comparison failed to convert both arguments
> to Unicode - interpreting them as being unequal  if
> sresults.has_key(textAtCursor):

What version of python are you using? Peter only managed to get the error
with python 2. 

Do you get the error with python 3?

Barry


> I really don't understand the business about "both arguments". Not sure how
> to proceed here. Hoping for a guideline!
> 
> Thanks.
> 
> 
> -- 
> 
>  Listen to my FREE CD at http://www.mellowood.ca/music/cedars 
> Bob van der Poel ** Wynndel, British Columbia, CANADA **
> EMAIL: b...@mellowood.ca
> WWW:   http://www.mellowood.ca
> -- 
> https://mail.python.org/mailman/listinfo/python-list
> 

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-07 Thread Peter Otten
Bob van der Poel wrote:

> I have some files which came off the net with, I'm assuming, unicode
> characters in the names. I have a very short program which takes the
> filename and puts into an emacs buffer, and then lets me add information
> to that new file (it's a poor man's DB).
> 
> Next, I can look up text in the file and open the saved filename.
> Everything works great until I hit those darn unicode filenames.
> 
> Just to confuse me even more, the error seems to be coming from a bit of
> tkinter code:
>  if sresults.has_key(textAtCursor):
> bookname = os.path.expanduser(sresults[textAtCursor].strip())
> 
> which generates
> 
>   UnicodeWarning: Unicode equal comparison failed to convert both
>   arguments
> to Unicode - interpreting them as being unequal  if
> sresults.has_key(textAtCursor):
> 
> I really don't understand the business about "both arguments". Not sure
> how to proceed here. Hoping for a guideline!

I cannot provoke the error with dict.has_key() over here, only with direct 
comparisons:

>>> u"a" == u"ä"
False
>>> u"a" == "ä"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both 
arguments to Unicode - interpreting them as being unequal
False

The problem is that you are mixing strings of type str and type unicode, and 
generally speaking the remedy is to use unicode throughout. In your case
this means opening files with io.open() or codecs.open() instead of the 
builtin, and invoking os.listdir() with a unicode argument.

I don't remember about Tkinter, I think it provides ascii-only strings as 
str and everything else as unicode. If that's correct you could play it safe 
with a conversion function:

def ensure_unicode(s):
if isinstance(s, bytes):
return s.decode("ascii")
return s

Your other option is to live with the *warning* -- it's not an error, just a 
reminder that you have to rethink your types once you switch to Python 3.

You can also switch off the message with

python -W ignore::UnicodeWarning yourscript

or by setting the PYTHONWARNINGS environment variable.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-06 Thread Terry Reedy

On 12/6/2019 1:17 PM, Bob van der Poel wrote:

I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information to
that new file (it's a poor man's DB).

Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.

Just to confuse me even more, the error seems to be coming from a bit of
tkinter code:
  if sresults.has_key(textAtCursor):
 bookname = os.path.expanduser(sresults[textAtCursor].strip())


'textAtCursor' does not appear in any 3.9 tkinter/*.py file


which generates

   UnicodeWarning: Unicode equal comparison failed to convert both arguments
to Unicode - interpreting them as being unequal  if
sresults.has_key(textAtCursor):

I really don't understand the business about "both arguments".


'sresults.has_key(textAtCursor)' will see if the hash value of 
textAtCursor matches the hash value of any key and then compare the 
strings.  'failed to convert' suggests to me that you are running 2.x 
and that one of the strings is bytes and the other unicode.



 Not sure how

to proceed here. Hoping for a guideline!

Thanks.





--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode filenames

2019-12-06 Thread DL Neil via Python-list

On 7/12/19 7:17 AM, Bob van der Poel wrote:

I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information to
that new file (it's a poor man's DB).

Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.

Just to confuse me even more, the error seems to be coming from a bit of
tkinter code:
  if sresults.has_key(textAtCursor):
 bookname = os.path.expanduser(sresults[textAtCursor].strip())

which generates

   UnicodeWarning: Unicode equal comparison failed to convert both arguments
to Unicode - interpreting them as being unequal  if
sresults.has_key(textAtCursor):

I really don't understand the business about "both arguments". Not sure how
to proceed here. Hoping for a guideline!



(I'm guessing that) the "both arguments" relates to expanduser() because 
this is the first time that the fileNM has been identified to Python as 
anything more than a string of characters.


[a fileNM will be a string of characters, but a string of characters is 
not necessarily a (legal) fileNM!]


Further suggesting, that if you are using Python3 (cf 2), your analysis 
may be the wrong-way-around. Python3 treats strings as Unicode. However, 
there is, and certainly in the past, was, no requirement for OpSys and 
IOCS to encode in Unicode.


The problem (for me) came from MSFT's (for example) many variations of 
ISO-8859-n and that there are no clues as to which of these was used in 
naming the file, and thus many possibly 'translations' into Unicode.


You can start to address the issue by using Python's bytes (instead of 
strings), however that cold reality still intrudes.


Do you know the provenance of these files, eg they are in French and 
from an MS-Win machine? If so, you may be able to use decode() and 
encode(), but...


Still looking for trouble? Knowing a fileNM was in Spanish/Portuguese I 
was able to take the fileNM's individual Unicode characters/surrogates 
and subtract an applicable constant, so that accented letters fell 
'back' into the correct Unicode range. (this is extremely risky, and 
could quite easily make matters worse/more confusing).


I warn you that pursuing this matter involves disappearing down into a 
very deep 'rabbit hole', but YMMV!


WebRefs:
https://docs.python.org/3/howto/unicode.html
https://www.dictionary.com/e/slang/rabbit-hole/
--
Regards =dn
--
https://mail.python.org/mailman/listinfo/python-list


Unicode filenames

2019-12-06 Thread Bob van der Poel
I have some files which came off the net with, I'm assuming, unicode
characters in the names. I have a very short program which takes the
filename and puts into an emacs buffer, and then lets me add information to
that new file (it's a poor man's DB).

Next, I can look up text in the file and open the saved filename.
Everything works great until I hit those darn unicode filenames.

Just to confuse me even more, the error seems to be coming from a bit of
tkinter code:
 if sresults.has_key(textAtCursor):
bookname = os.path.expanduser(sresults[textAtCursor].strip())

which generates

  UnicodeWarning: Unicode equal comparison failed to convert both arguments
to Unicode - interpreting them as being unequal  if
sresults.has_key(textAtCursor):

I really don't understand the business about "both arguments". Not sure how
to proceed here. Hoping for a guideline!

Thanks.


-- 

 Listen to my FREE CD at http://www.mellowood.ca/music/cedars 
Bob van der Poel ** Wynndel, British Columbia, CANADA **
EMAIL: b...@mellowood.ca
WWW:   http://www.mellowood.ca
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode UCS2, UCS4 and ... UCS1

2019-09-19 Thread MRAB

On 2019-09-19 09:55, Gregory Ewing wrote:

Eli the Bearded wrote:

There isn't anything called UCS1.


Apparently there is, but it's not a character set, it's a loudspeaker.

https://www.bhphotovideo.com/c/product/1205978-REG/yorkville_sound_ucs1_1200w_15_horn_loaded.html

The OP might mean Py_UCS1, which is an implementation detail of the 
Flexible String Representation.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode UCS2, UCS4 and ... UCS1

2019-09-19 Thread Gregory Ewing

Eli the Bearded wrote:

There isn't anything called UCS1.


Apparently there is, but it's not a character set, it's a loudspeaker.

https://www.bhphotovideo.com/c/product/1205978-REG/yorkville_sound_ucs1_1200w_15_horn_loaded.html

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode UCS2, UCS4 and ... UCS1

2019-09-17 Thread Chris Angelico
On Wed, Sep 18, 2019 at 6:51 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, moi   wrote:
> > I hope, one day, for those who are interested in Unicode,
> > they find a book, publication, ... which will explain
> > what is UCS1.
>
> There isn't anything called UCS1. There is a UTF-1, but don't use it.
> UTF-8 is better in every way.
>
> https://en.wikipedia.org/wiki/Universal_Coded_Character_Set
>

Don't waste your time talking to jmf. He doesn't listen. Most of us
don't see his posts, as they're blocked by the news-mailinglist
gateway, and a lot of newsgroup readers have killfiled him. I
recommend it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode UCS2, UCS4 and ... UCS1

2019-09-17 Thread Eli the Bearded
In comp.lang.python, moi   wrote:
> I hope, one day, for those who are interested in Unicode,
> they find a book, publication, ... which will explain
> what is UCS1.

There isn't anything called UCS1. There is a UTF-1, but don't use it.
UTF-8 is better in every way.

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

If you want it in book form, look for the "Create a book" link in the
side bar. I'd suggest 

https://en.wikipedia.org/wiki/Unicode
https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
https://en.wikipedia.org/wiki/UTF-8
https://en.wikipedia.org/wiki/UTF-16
https://en.wikipedia.org/wiki/UTF-32

As other things to include in your book.

Elijah
--
doesn't think there is a character encoding newsgroup
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: unicode mail list archeology

2019-04-20 Thread Luuk

On 20-4-2019 12:47, Luuk wrote:

On 20-4-2019 11:26, wxjmfa...@gmail.com wrote:

http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML018/0594.html



[quoot]
 > It is simple to make a compacter version of UTF-8 using the base
 > 256 character codes were possible (comacter for many languages).

No. If you think otherwise, you have completely misunderstood what UTF-8
is all about. Please read the section "What is UTF-8?" in
   http://www.cl.cam.ac.uk/~mgk25/unicode.html
carefully then you will see, why a base256 transfer encoding lacks
essential properties that make UTF-8 so damn useful.
[/quoot]

I must be one of the persons who do not understand what base256 transfer 
encoding means.


UTF-8 is, in bytes, just a sequence of 8 bit things, why can it not be 
transferred using a bas256 transfer  encoding?


$ echo "just my € 0.02 cents" | hexdump -C
6a 75 73 74 20 6d 79 20  e2 82 ac 20 30 2e 30 32 20 63 65 6e 74 73 0a



This is about python...

luuk@computer:$ python
Python 2.7.15rc1 (default, Nov 12 2018, 14:31:15)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a="just my € 0.02 cents"
>>> a
'just my \xe2\x82\xac 0.02 cents'
>>>

--
Luuk
--
https://mail.python.org/mailman/listinfo/python-list


Re: unicode mail list archeology

2019-04-20 Thread Luuk

On 20-4-2019 11:26, wxjmfa...@gmail.com wrote:

http://unicode.org/mail-arch/unicode-ml/Archives-Old/UML018/0594.html



[quoot]
> It is simple to make a compacter version of UTF-8 using the base
> 256 character codes were possible (comacter for many languages).

No. If you think otherwise, you have completely misunderstood what UTF-8
is all about. Please read the section "What is UTF-8?" in
  http://www.cl.cam.ac.uk/~mgk25/unicode.html
carefully then you will see, why a base256 transfer encoding lacks
essential properties that make UTF-8 so damn useful.
[/quoot]

I must be one of the persons who do not understand what base256 transfer 
encoding means.


UTF-8 is, in bytes, just a sequence of 8 bit things, why can it not be 
transferred using a bas256 transfer  encoding?


$ echo "just my € 0.02 cents" | hexdump -C
6a 75 73 74 20 6d 79 20  e2 82 ac 20 30 2e 30 32 20 63 65 6e 74 73 0a

--
Luuk
--
https://mail.python.org/mailman/listinfo/python-list


Re: Python2.7 unicode conundrum

2018-11-26 Thread Robert Latest via Python-list
Richard Damon wrote:
> Why do you say it has been convert to 'Latin'. The string prints as
> being Unicode. Internally Python doesn't store strings as UTF-8, but as
> plain Unicode (UCS-2 or UCS-4 as needed), and code-point E4 is the
> character you want.

You're right, this wasn't the minimal example for my problem after all.
Turns out that the actual issue is somewhere between SQLAlchemy and
MySQL. I took a more specific questioon overt to stackoverflow.com

Thanks
robert
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python2.7 unicode conundrum

2018-11-25 Thread Richard Damon
On 11/25/18 12:51 PM, Robert Latest via Python-list wrote:
> Hi folks,
> what semmingly started out as a weird database character encoding mix-up
> could be boiled down to a few lines of pure Python. The source-code
> below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
> third line of the hexdump). When just printed, the string "s" is
> displayed correctly as 'ä' (a umlaut), but the string representation
> shows that it seems to have been converted to latin-1 'e4' somewhere on
> the way.
> How can this be avoided?
>
> dh@jenna:~/python$ cat unicode.py
> # -*- encoding: utf8 -*-
>
> s = u'ä'
>
> print(s)
> print((s, ))
>
> dh@jenna:~/python$ hd unicode.py 
>   23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
> 0010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
> 0020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
> 0030  69 6e 74 28 28 73 2c 20  29 29 0a 0a  |int((s,))..|
> 003c
> dh@jenna:~/python$ python unicode.py
> ä
> (u'\xe4',)
> dh@jenna:~/python$
>
>
>
Why do you say it has been convert to 'Latin'. The string prints as
being Unicode. Internally Python doesn't store strings as UTF-8, but as
plain Unicode (UCS-2 or UCS-4 as needed), and code-point E4 is the
character you want.

The encoding statement tells python how your source file is encoded.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python2.7 unicode conundrum

2018-11-25 Thread Thomas Jollans
On 25/11/2018 18:51, Robert Latest via Python-list wrote:
> Hi folks,
> what semmingly started out as a weird database character encoding mix-up
> could be boiled down to a few lines of pure Python. The source-code
> below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
> third line of the hexdump). When just printed, the string "s" is
> displayed correctly as 'ä' (a umlaut), but the string representation
> shows that it seems to have been converted to latin-1 'e4' somewhere on
> the way.

It's not being converted to latin-1. It's a unicode string, as evidences
by the 'u'.

u'\xe4' is a unicode string with one character, U+00E4 (ä)

> How can this be avoided?
> 
> dh@jenna:~/python$ cat unicode.py
> # -*- encoding: utf8 -*-
> 
> s = u'ä'
> 
> print(s)
> print((s, ))
> 
> dh@jenna:~/python$ hd unicode.py 
>   23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
> 0010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
> 0020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
> 0030  69 6e 74 28 28 73 2c 20  29 29 0a 0a  |int((s,))..|
> 003c
> dh@jenna:~/python$ python unicode.py
> ä
> (u'\xe4',)
> dh@jenna:~/python$
> 
> 
> 

-- 
https://mail.python.org/mailman/listinfo/python-list


Python2.7 unicode conundrum

2018-11-25 Thread Robert Latest via Python-list
Hi folks,
what semmingly started out as a weird database character encoding mix-up
could be boiled down to a few lines of pure Python. The source-code
below is real utf8 (as evidenced by the UTF code point 'c3 a4' in the
third line of the hexdump). When just printed, the string "s" is
displayed correctly as 'ä' (a umlaut), but the string representation
shows that it seems to have been converted to latin-1 'e4' somewhere on
the way.
How can this be avoided?

dh@jenna:~/python$ cat unicode.py
# -*- encoding: utf8 -*-

s = u'ä'

print(s)
print((s, ))

dh@jenna:~/python$ hd unicode.py 
  23 20 2d 2a 2d 20 65 6e  63 6f 64 69 6e 67 3a 20  |# -*- encoding: |
0010  75 74 66 38 20 2d 2a 2d  0a 0a 73 20 3d 20 75 27  |utf8 -*-..s = u'|
0020  c3 a4 27 0a 0a 70 72 69  6e 74 28 73 29 0a 70 72  |..'..print(s).pr|
0030  69 6e 74 28 28 73 2c 20  29 29 0a 0a  |int((s,))..|
003c
dh@jenna:~/python$ python unicode.py
ä
(u'\xe4',)
dh@jenna:~/python$



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Email parsing and unicode/utf8

2018-10-15 Thread dieter
Thomas Jollans  writes:
> I just stumbled over some curious behaviour of the stdlib email parsing
> APIs which accept strings rather than bytes. It appears that you can't
> parse an 8-bit UTF-8 message you have as a str without first encoding it.

The primary purpose of an email parser is likely the parsing
of RFC 822/2045 messages which are a sequence of bytes,
encoded as dictated by RFC 822.
Therefore, I would expect some peculiarities when you feed such
a parser with general text.

-- 
https://mail.python.org/mailman/listinfo/python-list


Email parsing and unicode/utf8

2018-10-15 Thread Thomas Jollans
Hi,

I just stumbled over some curious behaviour of the stdlib email parsing
APIs which accept strings rather than bytes. It appears that you can't
parse an 8-bit UTF-8 message you have as a str without first encoding it.

The docs
 do
mention some problems (which I saw after the fact):

> class email.parser.FeedParser(_factory=None, *, policy=policy.compat32)
> 
> Works like BytesFeedParser except that the input to the feed() method 
> must be a string. This is of limited utility, since the only way for such a 
> message to be valid is for it to contain only ASCII text or, if utf8 is True, 
> no binary attachments.
> 
> Changed in version 3.3: Added the policy keyword.

Okay, cool - let's try parsing a message with text only (no attachments,
no BINARYMIME), with a UTF-8 Content-Type, and a policy with utf8=True.

Python 3.7.1rc2 (default, Oct 14 2018, 15:27:05)
[GCC 8.2.1 20180831 [gcc-8-branch revision 264010]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import email.parser, email.policy
>>> pol = email.policy.SMTPUTF8
>>> pol.utf8
True
>>> pol.cte_type
'8bit'
>>> msg = '''MIME-Version: 1.0
... Content-Type: text/plain; charset="utf-8"
... Content-Transfer-Encoding: 8bit
... Subject: ¿Will it parse? Нет.
...
... ¡This message contains two (٢) non-ASCII characters!
... '''
>>> fp = email.parser.FeedParser(policy=pol)
>>> fp.feed(msg)
>>> msg_obj = fp.close()
>>> msg_obj

>>> print(msg_obj.get_content())
�This message contains two (\u0662) non-ASCII characters!

>>> print(msg_obj['Subject'])
¿Will it parse? Нет.

I don't know WHAT it's doing with the body there... It doesn't look like
utf8 mode actually did anything. Interesting that the subject header
survived! Maybe this is what the utf8=True does?

>>> email.policy.default.utf8
False
>>> fp2 = email.parser.FeedParser(policy=email.policy.default)
>>> fp2.feed(msg)
>>> msg_obj2 = fp2.close()
>>> print(msg_obj2['Subject'])
¿Will it parse? Нет.

Nope. Apparently, contrary to what my reading of the docs suggests, the
utf8 flag does nothing at all when parsing.

Just to check that this was in fact a perfectly valid email:

>>> bfp = email.parser.BytesFeedParser(policy=pol)
>>> bfp.feed(msg.encode('utf-8'))
>>> msg_objb = bfp.close()
>>> print(msg_objb.get_content())
¡This message contains two (٢) non-ASCII characters!

>>> print(msg_objb['Subject'])
¿Will it parse? Нет.

BytesFeedParser is happy.

Question: Is this a bug? Am I missing something? Does the clause in the
docs about utf8 mean anything?

Cheers
Thomas
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-09 Thread Thomas Jollans
On 09/08/18 05:13, INADA Naoki wrote:
> Please use Python 3.7.
> 
> Python 3.7 has several improvements on this area.

Thanks! Darkly remembering something about UTF-8 mode, I suspected it
might...

> 
> * When PEP 538 or 540 is used, default error handler for stdio is
> surrogateescape
> * You can sys.stdout.reconfigure(errors='surrogateescape')
> 
> For Python 3.6, I think best way to allow arbitrary bytes on stdout is using
> `PYTHONIOENCODING=utf-8:surrogateescape` environment variable.



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread Marko Rauhamaa
INADA Naoki :

> For Python 3.6, I think best way to allow arbitrary bytes on stdout is
> using `PYTHONIOENCODING=utf-8:surrogateescape` environment variable.

Good info!


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread INADA Naoki
Please use Python 3.7.

Python 3.7 has several improvements on this area.

* When PEP 538 or 540 is used, default error handler for stdio is
surrogateescape
* You can sys.stdout.reconfigure(errors='surrogateescape')

For Python 3.6, I think best way to allow arbitrary bytes on stdout is using
`PYTHONIOENCODING=utf-8:surrogateescape` environment variable.

Regards,
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread Cameron Simpson

On 09Aug2018 03:14, MRAB  wrote:
[...]

Is it true that Unix filenames can contain control characters, e.g. \x07?


Yep. They're just byte strings. You can't have \0 (NUL) because the API uses 
NUL terminated strings, and you can't use slash '/' in the filename components 
because that is the component separator. But otherwise you can basicly use 
anything - the OS itself doesn't care.


There are some (platform dependent) length limits, and the underlying mounted 
filesystem you're accessing may itself have special rules (eg nonUNIX 
filesystems like FAT32, etc).



When happens when you print them out?


They get written out? If you're printing to a terminal of some kind then it 
will do whatever the bytes from the filename tell it to, as that's what 
terminals do.



import sys; import subprocess
subprocess.call([sys.executable, '\x07.py'])

.py
0




As you might expect, it beeped when printing '\x07.py' (and showed .py)


And that's OK, is it? :-)


Of course it is :-) \07 is the ASCII BEL character, so it rings the terminal's 
bell.  Modern software terminals emulate that to a better or worse degree.


Suppose you're verbally reciting a filename (or, of course, printing the 
filename to a voder). Only Victor Borge will provide a full verbal 
pronunciation of things [1]


[1] 
https://www.youtube.com/results?search_query=victor+gorge+phonetic+punctuation

Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread MRAB

On 2018-08-09 01:14, Thomas Jollans wrote:

On 09/08/18 01:48, MRAB wrote:

On 2018-08-08 23:16, Thomas Jollans wrote:

On *nix, file names are bytes. In real life, we prefer to think of file
names as strings. How non-ASCII file names are created is determined by
the locale, and on most systems these days, every locale uses UTF-8 and
everybody's happy. Of course this doesn't mean you'll never run into and
old directory tree from the pre-UTF8 age using some other encoding, and
it doesn't prevent people from doing silly things in file names.

Python deals with this tolerably well: by convention, file names are
strings, but you can use bytes for file names if you wish. The docs [1]
warn you about the situation.

[1] https://docs.python.org/3/library/os.path.html

If Python runs into a non-UTF8 (better: non-decodable) file name and has
to return a str, it uses surrogate escape codes. So far so good. Right?

This leads to the unfortunate situation that you can't always print()
file names, as print() is strict and refuses to toy with surrogates.

To be more explicit, the script

 print(__file__)

will fail depending on the file name. This feels wrong... (though every
bit of behaviour is correct)

(The situation can't arise on Windows, and Python 2 will pretend nothing
happened in true UNIX style)

Demo script to try at home below.


[snip]

Is it true that Unix filenames can contain control characters, e.g. \x07?

When happens when you print them out?

I think it's not just a problem with surrogate escapes.


Not a problem (or: not an exception), as those are ASCII and thus UTF-8.

Python 3.6.5 (default, Apr  1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

with open('\x07.py', 'w') as fp:

... fp.write('print(__file__)\n')
...
16

import sys; import subprocess
subprocess.call([sys.executable, '\x07.py'])

.py
0




As you might expect, it beeped when printing '\x07.py' (and showed .py)


And that's OK, is it? :-)
--
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread Thomas Jollans
On 09/08/18 01:48, MRAB wrote:
> On 2018-08-08 23:16, Thomas Jollans wrote:
>> On *nix, file names are bytes. In real life, we prefer to think of file
>> names as strings. How non-ASCII file names are created is determined by
>> the locale, and on most systems these days, every locale uses UTF-8 and
>> everybody's happy. Of course this doesn't mean you'll never run into and
>> old directory tree from the pre-UTF8 age using some other encoding, and
>> it doesn't prevent people from doing silly things in file names.
>>
>> Python deals with this tolerably well: by convention, file names are
>> strings, but you can use bytes for file names if you wish. The docs [1]
>> warn you about the situation.
>>
>> [1] https://docs.python.org/3/library/os.path.html
>>
>> If Python runs into a non-UTF8 (better: non-decodable) file name and has
>> to return a str, it uses surrogate escape codes. So far so good. Right?
>>
>> This leads to the unfortunate situation that you can't always print()
>> file names, as print() is strict and refuses to toy with surrogates.
>>
>> To be more explicit, the script
>>
>>  print(__file__)
>>
>> will fail depending on the file name. This feels wrong... (though every
>> bit of behaviour is correct)
>>
>> (The situation can't arise on Windows, and Python 2 will pretend nothing
>> happened in true UNIX style)
>>
>> Demo script to try at home below.
>>
> [snip]
> 
> Is it true that Unix filenames can contain control characters, e.g. \x07?
> 
> When happens when you print them out?
> 
> I think it's not just a problem with surrogate escapes.

Not a problem (or: not an exception), as those are ASCII and thus UTF-8.

Python 3.6.5 (default, Apr  1 2018, 05:46:30)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> with open('\x07.py', 'w') as fp:
... fp.write('print(__file__)\n')
...
16
>>> import sys; import subprocess
>>> subprocess.call([sys.executable, '\x07.py'])
.py
0
>>>

As you might expect, it beeped when printing '\x07.py' (and showed .py)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Non-unicode file names

2018-08-08 Thread MRAB

On 2018-08-08 23:16, Thomas Jollans wrote:

On *nix, file names are bytes. In real life, we prefer to think of file
names as strings. How non-ASCII file names are created is determined by
the locale, and on most systems these days, every locale uses UTF-8 and
everybody's happy. Of course this doesn't mean you'll never run into and
old directory tree from the pre-UTF8 age using some other encoding, and
it doesn't prevent people from doing silly things in file names.

Python deals with this tolerably well: by convention, file names are
strings, but you can use bytes for file names if you wish. The docs [1]
warn you about the situation.

[1] https://docs.python.org/3/library/os.path.html

If Python runs into a non-UTF8 (better: non-decodable) file name and has
to return a str, it uses surrogate escape codes. So far so good. Right?

This leads to the unfortunate situation that you can't always print()
file names, as print() is strict and refuses to toy with surrogates.

To be more explicit, the script

 print(__file__)

will fail depending on the file name. This feels wrong... (though every
bit of behaviour is correct)

(The situation can't arise on Windows, and Python 2 will pretend nothing
happened in true UNIX style)

Demo script to try at home below.


[snip]

Is it true that Unix filenames can contain control characters, e.g. \x07?

When happens when you print them out?

I think it's not just a problem with surrogate escapes.
--
https://mail.python.org/mailman/listinfo/python-list


Non-unicode file names

2018-08-08 Thread Thomas Jollans
On *nix, file names are bytes. In real life, we prefer to think of file
names as strings. How non-ASCII file names are created is determined by
the locale, and on most systems these days, every locale uses UTF-8 and
everybody's happy. Of course this doesn't mean you'll never run into and
old directory tree from the pre-UTF8 age using some other encoding, and
it doesn't prevent people from doing silly things in file names.

Python deals with this tolerably well: by convention, file names are
strings, but you can use bytes for file names if you wish. The docs [1]
warn you about the situation.

[1] https://docs.python.org/3/library/os.path.html

If Python runs into a non-UTF8 (better: non-decodable) file name and has
to return a str, it uses surrogate escape codes. So far so good. Right?

This leads to the unfortunate situation that you can't always print()
file names, as print() is strict and refuses to toy with surrogates.

To be more explicit, the script

print(__file__)

will fail depending on the file name. This feels wrong... (though every
bit of behaviour is correct)

(The situation can't arise on Windows, and Python 2 will pretend nothing
happened in true UNIX style)

Demo script to try at home below.

-- Thomas


# -*- coding: UTF-8 -*-
from __future__ import unicode_literals, print_function

import sys
import os.path
import subprocess
import tempfile
import shutil

script = 'print(__file__)\n'

file_names = ['🐪.py', '€.py', '€.py'.encode('latin9')]

PY = sys.executable

tmpdir = tempfile.mkdtemp()

for fn in file_names:
if isinstance(fn, bytes):
path = os.path.join(tmpdir.encode('ascii'), fn)
else:
path = os.path.join(tmpdir, fn)

print('► creating', path)
with open(path, 'w') as fp:
fp.write(script)

print('► running', PY, path)
status = subprocess.call([PY, path])
print('► exited with status', status)

print('► cleaning up')
shutil.rmtree(tmpdir)

# End of script
###
# Output from Python 3.6.5 on Linux (Ubuntu 18.04)::
#
# ► creating /tmp/tmp_a4h5n22/🐪.py
# ► running /usr/bin/python3 /tmp/tmp_a4h5n22/🐪.py
# /tmp/tmp_a4h5n22/🐪.py
# ► exited with status 0
# ► creating /tmp/tmp_a4h5n22/€.py
# ► running /usr/bin/python3 /tmp/tmp_a4h5n22/€.py
# /tmp/tmp_a4h5n22/€.py
# ► exited with status 0
# ► creating b'/tmp/tmp_a4h5n22/\xa4.py'
# ► running /usr/bin/python3 b'/tmp/tmp_a4h5n22/\xa4.py'
# Traceback (most recent call last):
#   File "/tmp/tmp_a4h5n22/\udca4.py", line 1, in 
# print(__file__)
# UnicodeEncodeError: 'utf-8' codec can't encode character '\udca4'
in position 17: surrogates not allowed
# ► exited with status 1
# ► cleaning up
#
# Python 2.7.15rc1 on Linux (Ubuntu):
#
# ► creating /tmp/tmp_U_LPp/🐪.py
# ► running /usr/bin/python2 /tmp/tmp_U_LPp/🐪.py
# /tmp/tmp_U_LPp/🐪.py
# ► exited with status 0
# ► creating /tmp/tmp_U_LPp/€.py
# ► running /usr/bin/python2 /tmp/tmp_U_LPp/€.py
# /tmp/tmp_U_LPp/€.py
# ► exited with status 0
# ► creating /tmp/tmp_U_LPp/�.py
# ► running /usr/bin/python2 /tmp/tmp_U_LPp/�.py
# /tmp/tmp_U_LPp/�.py
# ► exited with status 0
# ► cleaning up
#
# Python 3.7.0 on Windows 10::
#
# ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py
# ► running
C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe
C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\�
# �.py
# C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\🐪.py
# ► exited with status 0
# ► creating C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py
# ► running
C:\Users\tjol\AppData\Local\Programs\Python\Python37\python.exe
C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€
# .py
# C:\Users\tjol\AppData\Local\Temp\tmpzprwnyc2\€.py
# ► exited with status 0
# ► creating
b'C:\\Users\\tjol\\AppData\\Local\\Temp\\tmpzprwnyc2\\\xa4.py'
# Traceback (most recent call last):
#   File ".\bytes_file_names2.py", line 25, in 
# with open(path, 'w') as fp:
# UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in
position 45: invalid start byte
#
# Python 2.7.15 on Windows 10:
#
# Traceback (most recent call last):
#   File ".\bytes_file_names2.py", line 24, in 
# print('Ôû║ creating', path)
#   File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
# return codecs.charmap_encode(input,errors,encoding_map)
# UnicodeEncodeError: 'charmap' codec can't encode character
u'\u25ba' in position 0: character maps to 



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-17 Thread Tim Chase
On 2018-07-17 08:37, Marko Rauhamaa wrote:
> Tim Chase :
> > Wait, but now you're talking about vendors. Much of the crux of
> > this discussion has been about personal scripts that don't need to
> > marshal Unicode strings in and out of various functions/objects.  
> 
> In both personal and professional settings, you face the same
> issues. But you don't want to build on something that will
> disappear from the Linux distros.

Right.  Distros are moving away from ASCII-only to proper Unicode
(however it is encoded) support.  Certainly wouldn't want to build on
something that's disappearing from distros, so best to build on
Py3 and Unicode strings.  ;-)

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Tim Chase :

> On 2018-07-16 23:59, Marko Rauhamaa wrote:
>> Tim Chase :
>> > While the python world has moved its efforts into improving
>> > Python3, Python2 hasn't suddenly stopped working.  
>> 
>> The sword of Damocles is hanging on its head. Unless a consortium is
>> erected to support Python2, no vendor will be able to use it in the
>> medium term.
>
> Wait, but now you're talking about vendors. Much of the crux of this
> discussion has been about personal scripts that don't need to
> marshal Unicode strings in and out of various functions/objects.

In both personal and professional settings, you face the same issues.
But you don't want to build on something that will disappear from the
Linux distros.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Tim Chase
On 2018-07-16 23:59, Marko Rauhamaa wrote:
> Tim Chase :
> > While the python world has moved its efforts into improving
> > Python3, Python2 hasn't suddenly stopped working.  
> 
> The sword of Damocles is hanging on its head. Unless a consortium is
> erected to support Python2, no vendor will be able to use it in the
> medium term.

Wait, but now you're talking about vendors. Much of the crux of this
discussion has been about personal scripts that don't need to
marshal Unicode strings in and out of various functions/objects.

If you have a py2 script that works with py2 and breaks with py3, and
you don't want to update to py3 unicode-strings-by-default, then
stick with py2.  They even coexist nicely on the same machine.

It doesn't have a self-destruct clause.  As long as py2 continues to
build, it will continue to run which is a long lifetime.  To point,
I still have the "joy" of maintaining some py2.4 code that's in
production.  Would I rather upgrade it to 3.x?  You bet.  But the
powers in place are willing to forego python updates in order to not
rock the boat.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Unicode is not UTF-32 [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 22:40:13 +0300, Marko Rauhamaa wrote:

> Terry Reedy :
> 
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>> if your new system used Python3's UTF-32 strings as a foundation,
>>
>> Since 3.3, Python's strings are not (always) UFT-32 strings.
> 
> You are right. Python's strings are a superset of UTF-32. More
> accurately, Python's strings are UTF-32 plus surrogate characters.

The first thing you are doing wrong is conflating the semantics of the 
data type with one possible implementation of that data type. UTF-32 is 
implementation, not semantics: it specifies how to represent Unicode code 
points as bytes in memory, not what Unicode code points are.

Python 3 strings are sequences of abstract characters ("code points") 
with no mandatory implementation. In CPython, some string objects are 
encoded in Latin-1. Some are encoded in UTF-16. Some are encoded in 
UTF-32. Some implementations (MicroPython) use UTF-8.

Your second error is a more minor point: it isn't clear (at least not to 
me) that "Unicode plus surrogates" is a superset of Unicode. Surrogates 
are part of Unicode. The only extension here is that Python strings are 
not necessarily well-formed surrogate-free Unicode strings, but they're 
still Unicode strings.


>> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
>> always Latin-1 or Ascii strings. Python's Flexible String
>> Representation uses the narrowest possible internal code for any
>> particular string. This is all transparent to the user except for
>> memory size.
> 
> How CPython chooses to represent its strings internally is not what I'm
> talking about.

Then why do you repeatedly talk about the internal storage representation?

UTF-32 is not a character set, it is an encoding. It specifies how to 
implement a sequence of Unicode abstract characters.


>>> UTF-32, after all, is a variable-width encoding.
>>
>> Nope.  It a fixed-width (32 bits, 4 bytes) encoding.
>>
>> Perhaps you should ask more questions before pontificating.
> 
> You mean each code point is one code point wide. But that's rather an
> irrelevant thing to state.

No, he means that each code point is one code unit wide.


> The main point is that UTF-32 (aka Unicode)

UTF-32 is not a synonym for Unicode. Many legacy encodings don't 
distinguish between the character set and the mapping between bytes and 
characters, but Unicode is not one of those.


> uses one or more code points to represent what people would consider an
> individual character.

That's a reasonable observation to make. But that's not what fixed- and 
variable-width refers to.

So does ASCII, and in both cases, it is irrelevant since the term of art 
is to define fixed- and variable-width in terms of *code points* not 
human meaningful characters. "Character" is context- and language-
dependent and frequently ambiguous. "LL" or "CH" (for example) could be a 
single character or a double character, depending on context and language.

Even in ASCII English, something as large as "ough" might be considered 
to be a single unit of language, which some people might choose to call a 
character. (But not a single letter, naturally.) If you don't like that 
example, "qu" is probably a better one: aside from acronyms and loan 
words, no modern English word can fail to follow a Q with a U.


> Code points are about as interesting as individual bytes in UTF-8.

That's your opinion. I see no justification for it.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Mark Lawrence

On 16/07/18 21:16, Rhodri James wrote:

On 16/07/18 20:58, Terry Reedy wrote:

On 7/16/2018 1:27 PM, Jim Lee wrote:

90% of the world *is* "beneath my notice" when it comes to 
programming for myself.   I really don't care if that's not PC enough 
for you.


Had you actually read my words with *intent* rather than *reaction*, 
you would notice that I suggested the *option* of turning off 
Unicode.  I didn't say get *rid* of Unicode.  I didn't say make it 
*harder* to use Unicode.  Once again - reaction rather than reading.


Obviously, the most vocal representatives of the Python community are 
too sensitive about their language to enable rational discussion.


My empirical observation is that the more abrasive posters get 
rewarded with more response, while my attempts to engage in rational 
discussion, without ad hominems, gets less.


I wouldn't disagree with you.  Fortunately Jim has pulled the "storming 
off in a huff rather than answer a question anyone actually asked" 
defence, so we can go back to debating about important things like how 
to spell assignment expressions.


Oh wait... :-)



Cheeky :)

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread MRAB

On 2018-07-16 21:59, Marko Rauhamaa wrote:

Tim Chase :

While the python world has moved its efforts into improving Python3,
Python2 hasn't suddenly stopped working.


The sword of Damocles is hanging on its head. Unless a consortium is
erected to support Python2, no vendor will be able to use it in the
medium term.

Given the recent events, maybe that's exactly what's going to happen. A
business consortium will take it on themselves to support and enhance
Python2 ad infinitum. I wouldn't be surprised.

(Although it might make me regret my knee-jerk porting effort.)


In open source, it's up to those with the itch to scratch it.

Someone finally did, and it's called Tauthon.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 6:32 AM, Tim Chase
 wrote:
> On 2018-07-16 18:31, Steven D'Aprano wrote:
>> You say that all you want is a switch to turn off Unicode (and
>> replace it with what? Kanji strings? Cyrillic? Shift_JS? no of
>> course not, I'm being absurd -- replace it with ASCII, what else
>> could any right-thinking person want, right?).
>
> But we already have this.  If I want to turn off Unicode strings, I
> type "python2", and if I want to enable Unicode strings, I type
> "python3".
>
> While the python world has moved its efforts into improving Python3,
> Python2 hasn't suddenly stopped working.  It just stopped receiving
> improvements.  If the "old-man shakes-fist at progress" crowd
> doesn't like unicode stings in Py3, just keep on using Py2.  You
> (generic) won't get arrested.  There are no church^WPython police.

Except that Python 2 still supports Unicode, and Python 3 still
supports bytes. Py3 just makes a stronger distinction between text and
bytes.

>>> b"Hello, %s!" % b"world"
b'Hello, world!'

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Tim Chase :
> While the python world has moved its efforts into improving Python3,
> Python2 hasn't suddenly stopped working.

The sword of Damocles is hanging on its head. Unless a consortium is
erected to support Python2, no vendor will be able to use it in the
medium term.

Given the recent events, maybe that's exactly what's going to happen. A
business consortium will take it on themselves to support and enhance
Python2 ad infinitum. I wouldn't be surprised.

(Although it might make me regret my knee-jerk porting effort.)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Tim Chase
On 2018-07-16 18:31, Steven D'Aprano wrote:
> You say that all you want is a switch to turn off Unicode (and
> replace it with what? Kanji strings? Cyrillic? Shift_JS? no of
> course not, I'm being absurd -- replace it with ASCII, what else
> could any right-thinking person want, right?).

But we already have this.  If I want to turn off Unicode strings, I
type "python2", and if I want to enable Unicode strings, I type
"python3".

While the python world has moved its efforts into improving Python3,
Python2 hasn't suddenly stopped working.  It just stopped receiving
improvements.  If the "old-man shakes-fist at progress" crowd
doesn't like unicode stings in Py3, just keep on using Py2.  You
(generic) won't get arrested.  There are no church^WPython police.

-tkc


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 6:16 AM, Rhodri James  wrote:
> On 16/07/18 20:58, Terry Reedy wrote:
>>
>> On 7/16/2018 1:27 PM, Jim Lee wrote:
>>
>>> 90% of the world *is* "beneath my notice" when it comes to programming
>>> for myself.   I really don't care if that's not PC enough for you.
>>>
>>> Had you actually read my words with *intent* rather than *reaction*, you
>>> would notice that I suggested the *option* of turning off Unicode.  I didn't
>>> say get *rid* of Unicode.  I didn't say make it *harder* to use Unicode.
>>> Once again - reaction rather than reading.
>>>
>>> Obviously, the most vocal representatives of the Python community are too
>>> sensitive about their language to enable rational discussion.
>>
>>
>> My empirical observation is that the more abrasive posters get rewarded
>> with more response, while my attempts to engage in rational discussion,
>> without ad hominems, gets less.
>
>
> I wouldn't disagree with you.  Fortunately Jim has pulled the "storming off
> in a huff rather than answer a question anyone actually asked" defence, so
> we can go back to debating about important things like how to spell
> assignment expressions.
>
> Oh wait... :-)
>

+1 QOTD.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Rhodri James

On 16/07/18 20:58, Terry Reedy wrote:

On 7/16/2018 1:27 PM, Jim Lee wrote:

90% of the world *is* "beneath my notice" when it comes to programming 
for myself.   I really don't care if that's not PC enough for you.


Had you actually read my words with *intent* rather than *reaction*, 
you would notice that I suggested the *option* of turning off 
Unicode.  I didn't say get *rid* of Unicode.  I didn't say make it 
*harder* to use Unicode.  Once again - reaction rather than reading.


Obviously, the most vocal representatives of the Python community are 
too sensitive about their language to enable rational discussion.


My empirical observation is that the more abrasive posters get rewarded 
with more response, while my attempts to engage in rational discussion, 
without ad hominems, gets less.


I wouldn't disagree with you.  Fortunately Jim has pulled the "storming 
off in a huff rather than answer a question anyone actually asked" 
defence, so we can go back to debating about important things like how 
to spell assignment expressions.


Oh wait... :-)

--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Anders Wegge Keller
På Mon, 16 Jul 2018 11:33:46 -0700
Jim Lee  skrev:

> Go right ahead.  I find it surprising that Stephen isn't banned, 
> considering the fact that he ridicules anyone he doesn't agree with.  
> But I guess he's one of the 'good 'ol boys', and so exempt from the code 
> of conduct.

Well said!

-- 
//Wegge
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Terry Reedy

On 7/16/2018 1:27 PM, Jim Lee wrote:

90% of the world *is* "beneath my notice" when it comes to programming 
for myself.   I really don't care if that's not PC enough for you.


Had you actually read my words with *intent* rather than *reaction*, you 
would notice that I suggested the *option* of turning off Unicode.  I 
didn't say get *rid* of Unicode.  I didn't say make it *harder* to use 
Unicode.  Once again - reaction rather than reading.


Obviously, the most vocal representatives of the Python community are 
too sensitive about their language to enable rational discussion.


My empirical observation is that the more abrasive posters get rewarded 
with more response, while my attempts to engage in rational discussion, 
without ad hominems, gets less.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Terry Reedy

On 7/16/2018 1:13 PM, Jim Lee wrote:

I just think that a language should allow one to bypass Unicode handling 
easily *when it's not needed*.


Both for patching IDLE and for my currently private work, I usually only 
use Ascii, and no unicode escapes.  When I do, it does not matter 
whether editor and python internally use ascii unicode or ascii bytes. 
So I don't understand 'bypass Unicode handling'.


When I do want to use other characters, whether to test IDLE or just for 
fun, Python 3 in much nicer.  Since I have not bothered to learn ann 
non-Englich Windows Input Methods, I just use \u or, for non-BMP 
chars, \U000n escapes.  I don't need a 'u' prefix or unicode(s, 
encoding=???) conversion.  Thus, I was able to expand IDLE's font sample 
of the font selection dialog tab from 40 ascii chars to this.



AaBbCcDdEeFfGgHhIiJj
1234567890#:+=(){}[]
¢£¥§©«®¶½ĞÀÁÂÃÄÅÇÐØß


ɐɕɘɞɟɤɫɮɰɷɻʁʃʆʎʞʢʫʭʯ
ΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκ
БбДдЖжПпФфЧчЪъЭэѠѤѬӜ


אבגדהוזחטיךכלםמןנסעף
ابجدهوزحطي٠١٢٣٤٥٦٧٨٩


०१२३४५६७८९अआइईउऊएऐओऔ
௦௧௨௩௪௫௬௭௮௯அஇஉஎ


〇一二三四五六七八九
汉字漢字人木火土金水
가냐더려모뵤수유즈치
あいうえおアイウエオ

*You* may not care about the non-Ascii parts, but people who use other 
scripts do.



So I don't understand why you are bothered by having the option of 
easily using other characters if you want to, or if external 
circumstances were to compel you.  I love it.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Rhodri James

On 16/07/18 18:38, Rhodri James wrote:
Actually having an option of turning off Unicode *does* make it harder 
to use, because you end up coming across programs that have Unicode and 
surprise you when they misbehave.  And yes I saw that 90% of your 
programs aren't intended to get out into the world.  90% is never meant 
to leave the office.  90% of that does anyway.


I meant to say "90% *of my Python code* is never meant to leave the 
office."  Never post when in a hurry :-(


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Jim Lee




On 07/16/18 11:31, Steven D'Aprano wrote:

On Mon, 16 Jul 2018 10:27:18 -0700, Jim Lee wrote:


Had you actually read my words with *intent* rather than *reaction*, you
would notice that I suggested the *option* of turning off Unicode.

Yes, I know what you wrote, and I read it with intent.

Jim, you seem to be labouring under the misapprehension that anytime
somebody spots a flaw in your argument, or an unpleasant implication of
your words, it can only be because they must not have read your words
carefully. Believe me, that is not the case.

YOU are the one who raised the specter of politically correct groupthink,
not me. That's dog-whistle politics. But okay, let's move on from that.

You say that all you want is a switch to turn off Unicode (and replace it
with what? Kanji strings? Cyrillic? Shift_JS? no of course not, I'm being
absurd -- replace it with ASCII, what else could any right-thinking
person want, right?). Let's look at this from a purely technical
perspective:

Python already has two string data types, bytes and text. You want
something that is almost functionally identical to bytes, but to call it
text, presumably because you don't want to have to prefix your strings
with a b"" (that was also Marko's objection to byte strings).

Let's say we do it. Now we have three string implementations that need to
be added, documented, tested, maintained, instead of two.

(Are you volunteering to do this work?)

Now we need to double the testing: every library needs to be tested
twice, once with the "Unicode text" switch on, once with it off, to
ensure that features behave as expected in the appropriate mode.

Is this switch a build-time option, so that we have interpreters built
with support for Unicode and interpreters built without it? We've been
there: it's a horribly bad idea. We used to have Python builds with
threading support, and others without threading support. We used to have
Python builds with "wide Unicode" and others with "narrow Unicode".
Nothing good comes of this design.

Or perhaps the switch is a runtime global option?

Surely you can imagine the opportunities for bugs, both obvious crashing
bugs and non-obvious silent failure bugs, that will occur when users run
libraries intended for one mode under the other mode. Not every library
is going to be fully tested under both modes.

Perhaps it is a compile-time option that only affects the current module,
like the __future__ imports. That's a bit more promising, it might even
use the __future__ infrastructure -- but then you have the problem of
interaction between modules that have this switch enabled and those that
have it disabled.

More complexity, more cruft, more bugs.

It's not clear that your switch gives us *any* advantage at all, except
the warm fuzzy feelings that no dirty foreign characters might creep into
our pure ASCII strings. Hmm, okay, but frankly apart from when I copy and
paste code from the internet and it ends up bringing in en-dashes and
curly quotes instead of hyphens and type-writer quotes, that never
happens to me by accident, and I'm having a lot of trouble seeing how it
could.

If you want ASCII byte strings, you have them right now -- you just have
to use the b"" string syntax.

If you want ASCII strings without the b prefix, you have them right now.
Just use only ASCII characters in your strings.

I'm simply not seeing the advantage of:

 from __future__ import no_unicode
 print("Hello World!")  # stand in for any string handling on ASCII

over

 print("Hello World!")

which works just as well if you control the data you are working with and
know that it is pure ASCII.




Had you spoken this way from the start instead of ridiculing and name 
calling, perhaps we could have reached an agreement.


However, the point is moot, as I have unsubscribed from the list. The 
conversations here (especially yours) are too condescending to waste 
more time with.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Jim Lee



On 07/16/18 10:40, Mark Lawrence wrote:

On 16/07/18 18:27, Jim Lee wrote:


Obviously, the most vocal representatives of the Python community are 
too sensitive about their language to enable rational discussion.
Please moderators ban this person as he's going down the same line as 
bartc and similar, it is completely unacceptable, he's just the latest 
in a long line of trolls.




That was completely predictable (though I expected it from a different 
person).


Go right ahead.  I find it surprising that Stephen isn't banned, 
considering the fact that he ridicules anyone he doesn't agree with.  
But I guess he's one of the 'good 'ol boys', and so exempt from the code 
of conduct.


Bye guys.

--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Rhodri James

On 16/07/18 19:31, Steven D'Aprano wrote:

I'm simply not seeing the advantage of:

 from __future__ import no_unicode
 print("Hello World!")  # stand in for any string handling on ASCII


Sure this should be "from __past__ import no_unicode"?

gd&r

--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Unicode [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 10:27:18 -0700, Jim Lee wrote:

> Had you actually read my words with *intent* rather than *reaction*, you
> would notice that I suggested the *option* of turning off Unicode.

Yes, I know what you wrote, and I read it with intent.

Jim, you seem to be labouring under the misapprehension that anytime 
somebody spots a flaw in your argument, or an unpleasant implication of 
your words, it can only be because they must not have read your words 
carefully. Believe me, that is not the case.

YOU are the one who raised the specter of politically correct groupthink, 
not me. That's dog-whistle politics. But okay, let's move on from that.

You say that all you want is a switch to turn off Unicode (and replace it 
with what? Kanji strings? Cyrillic? Shift_JS? no of course not, I'm being 
absurd -- replace it with ASCII, what else could any right-thinking 
person want, right?). Let's look at this from a purely technical 
perspective:

Python already has two string data types, bytes and text. You want 
something that is almost functionally identical to bytes, but to call it 
text, presumably because you don't want to have to prefix your strings 
with a b"" (that was also Marko's objection to byte strings).

Let's say we do it. Now we have three string implementations that need to 
be added, documented, tested, maintained, instead of two.

(Are you volunteering to do this work?)

Now we need to double the testing: every library needs to be tested 
twice, once with the "Unicode text" switch on, once with it off, to 
ensure that features behave as expected in the appropriate mode.

Is this switch a build-time option, so that we have interpreters built 
with support for Unicode and interpreters built without it? We've been 
there: it's a horribly bad idea. We used to have Python builds with 
threading support, and others without threading support. We used to have 
Python builds with "wide Unicode" and others with "narrow Unicode". 
Nothing good comes of this design.

Or perhaps the switch is a runtime global option?

Surely you can imagine the opportunities for bugs, both obvious crashing 
bugs and non-obvious silent failure bugs, that will occur when users run 
libraries intended for one mode under the other mode. Not every library 
is going to be fully tested under both modes.

Perhaps it is a compile-time option that only affects the current module, 
like the __future__ imports. That's a bit more promising, it might even 
use the __future__ infrastructure -- but then you have the problem of 
interaction between modules that have this switch enabled and those that 
have it disabled.

More complexity, more cruft, more bugs.

It's not clear that your switch gives us *any* advantage at all, except 
the warm fuzzy feelings that no dirty foreign characters might creep into 
our pure ASCII strings. Hmm, okay, but frankly apart from when I copy and 
paste code from the internet and it ends up bringing in en-dashes and 
curly quotes instead of hyphens and type-writer quotes, that never 
happens to me by accident, and I'm having a lot of trouble seeing how it 
could.

If you want ASCII byte strings, you have them right now -- you just have 
to use the b"" string syntax.

If you want ASCII strings without the b prefix, you have them right now. 
Just use only ASCII characters in your strings.

I'm simply not seeing the advantage of:

from __future__ import no_unicode
print("Hello World!")  # stand in for any string handling on ASCII

over 

print("Hello World!")

which works just as well if you control the data you are working with and 
know that it is pure ASCII.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


  1   2   3   4   5   6   7   8   9   10   >