Re: [Python-Dev] Bytes path support

2014-08-27 Thread Stephen J. Turnbull
Glenn Linderman writes:
 > On 8/27/2014 5:16 AM, Nick Coghlan wrote:

 > > Choosing UTF-8 aims to treat formatting text for communication with 
 > > the user as "just a display issue". It's a low impact design that will 
 > > "just work" for a lot of software, but it comes at a price:
 > >
 > >   * because encoding consistency checks are mostly avoided, data in
 > > different encodings may be freely concatenated and passed on to
 > > other applications. Such data is typically not usable by the
 > > receiving application.
 > 
 > I don't believe this is a necessary result of using UTF-8.

No, it's not, but if you're going to do the same kind of checks that
are necessary for transcoding UTF-8 to abstract Unicode, there's no
benefit to using UTF-8 internally, and you lose a lot.  The only
operations that you can do efficiently are concatenation and
iteration.  I've worked with a UTF-8-like internal encoding for 20
years now -- it's a huge cost.

 > Python3 could have evolved to using UTF-8 as its underlying data
 > format, and obtained equal encoding consistency as it has today.

Thank heaven it didn't!

 > One of the choices of Python3, was to retain character indexing as an 
 > underlying arithmetic implementation citing algorithmic speed, but that 
 > is a seldom needed operation,

That simply isn't true.  The negative effects of algorithmic slowness
in Emacsen are visible both as annoying user delays, and as excessive
developer concentration on optimizing a fundamentally insufficient
data structure.

 > and of limited general applicability when considering grapheme
 > clusters.  An iterator based approach can solve both problems,

On the contrary, grapheme clusters are the relatively rare use case in
textual computing, at least currently, that can be optimized for when
necessary.  There's no problem with creating iterators from arrays,
but making an iterator behave like a array ... well, that involves
creating the array.

 > Such solutions could still be implemented as options.

Sure, but the problems to be solved in that implementation are not due
to Python 3's internal representation.  A lot of painstaking (and
possibly hard?) work remains to be done.

 > A high-performance implementation would likely need to be
 > implemented at least partly in C rather than CPython,

That's how Emacs did it, and (a) over the decades it has involved an
inordinate amount of effort compared to rewriting the text-handling
functions for an array, (b) is fragile, and (c) performance sucks in
practice.

Unicode, not UTF-8, is the central component of the solution.  The
various UTFs are application-specific implementations of Unicode.
UTF-8 is an excellent solution for text streams, such as disk files
and network communication.  Fixed-width representations (ISO-8859-1,
UCS-2, UTF-32, PEP-393) are useful for applications of large buffers
that need O(1) "random" access, and can trivially be iterated for
stream applications.

Steve
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-27 Thread Nick Coghlan
On 28 Aug 2014 04:20, "Glenn Linderman"  wrote:
>
> On 8/27/2014 5:16 AM, Nick Coghlan wrote:
>>
>> On 27 August 2014 08:52, Nick Coghlan  wrote:
>>>
>>> On 27 Aug 2014 02:52, "Terry Reedy"  wrote:

 Nick, I think the first half of your post is one of the clearest
 expositions yet of 'why Python 3' (in particular, the str to unicode
 change).  It is worthy of wider distribution and without much change,
it
 would be a great blog post.
>>>
>>> Indeed, I had the same idea - I had been assuming users already
understood
>>> this context, which is almost certainly an invalid assumption.
>>>
>>> The blog post version is already mostly written, but I ran out of
weekend.
>>> Will hopefully finish it up and post it some time in the next few days
:)
>>
>> Aaand, it's up:
>>
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html
>>
>> Cheers,
>> Nick.
>>
>
> Indeed, I also enjoyed and found enlightening your response to this
issue, including the broader historical context. I remember when Unicode
was first published back in 1991, and it sounded interesting, but far
removed from the reality of implementations of the day. I was intrigued by
UTF-8 at the time, and even wrote an encoder and decoder for it for a
software package that eventually never reached any real customers.
>
> Your blog post says:
>>
>> Choosing UTF-8 aims to treat formatting text for communication with the
user as "just a display issue". It's a low impact design that will "just
work" for a lot of software, but it comes at a price:
>>
>> because encoding consistency checks are mostly avoided, data in
different encodings may be freely concatenated and passed on to other
applications. Such data is typically not usable by the receiving
application.
>
>
> I don't believe this is a necessary result of using UTF-8. It is a
possible result, and I guess some implementations are using it this way,
but a proper language could still provide and/or require proper usage of
UTF-8 data through its type system just as Python3 is doing with PEP 393.

Yes, Go works that way, for example. I doubt it actually checks for valid
UTF-8 at OS boundaries though - that would be a potentially expensive
check, and as a network service centric language, Go can afford to place
more constraints on the operating environment than we can.

>In fact, if it were not for the requirement to support passing character
strings in other formats (UTF-16, UTF-32) to historical APIs (in CPython
add-on packages) and the resulting practical performance considerations of
converting to/from UTF-8 repeatedly when calling those APIs, Python3 could
have evolved to using UTF-8 as its underlying data format, and obtained
equal encoding consistency as it has today.

We already have string processing algorithms that work for fixed width
encodings (and are known not to work for variable width encodings, hence
the bugs in Unicode handling on the old narrow builds).

It isn't that variable width encodings aren't a viable choice for
programming language text modelling, it's that the assumption of a fixed
width model is more deeply entrenched in CPython (and especially the C API)
than the exact number of bits used per code point.

> Of course, nothing can be "required" if the user chooses to continue
operating in the encoded domain, and manipulate data using the necessary
byte-oriented features of of whatever language is in use.
>
> One of the choices of Python3, was to retain character indexing as an
underlying arithmetic implementation citing algorithmic speed, but that is
a seldom needed operation, and of limited general applicability when
considering grapheme clusters.

The choice that was made was to say no to the question "Do we rewrite a
Unicode type that we already know works from scratch?". The decisions about
how to handle *text* were made way back before the PEP process even
existed, and later captured as PEP 100.

What changed in Python 3 was dropping the hybrid 8-bit str type with its
locale dependent behaviour, and parcelling its responsibilities out to
either the existing unicode type (renamed as str, as it was the default
choice), or the new locale independent bytes type.

> An iterator based approach can solve both problems, but would have been
best introduced as part of Python3.0, although it may have made 2to3
harder, and may have made it less practical to implement six and other "run
on both Py2 and Py3" type solutions harder, without introducing those same
iterative solutions into Python 2.6 or 2.7.

The option of fundamentally changing the text handling design was never on
the table. The Python 2 unicode type works fine, it is the Python 2 str
type that needed changing.

> Such solutions could still be implemented as options. Even PEP 393
grudgingly supports some use of UTF-8 when requested by the user, as I
understand it.

Not quite. PEP 393 heavily favours and optimises UTF-8, trading memory for
speed by implicitly caching the UTF-8 representation

Re: [Python-Dev] Bytes path support

2014-08-27 Thread Glenn Linderman

On 8/27/2014 5:16 AM, Nick Coghlan wrote:

On 27 August 2014 08:52, Nick Coghlan  wrote:

On 27 Aug 2014 02:52, "Terry Reedy"  wrote:

Nick, I think the first half of your post is one of the clearest
expositions yet of 'why Python 3' (in particular, the str to unicode
change).  It is worthy of wider distribution and without much change, it
would be a great blog post.

Indeed, I had the same idea - I had been assuming users already understood
this context, which is almost certainly an invalid assumption.

The blog post version is already mostly written, but I ran out of weekend.
Will hopefully finish it up and post it some time in the next few days :)

Aaand, it's up:
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html

Cheers,
Nick.



Indeed, I also enjoyed and found enlightening your response to this 
issue, including the broader historical context. I remember when Unicode 
was first published back in 1991, and it sounded interesting, but far 
removed from the reality of implementations of the day. I was intrigued 
by UTF-8 at the time, and even wrote an encoder and decoder for it for a 
software package that eventually never reached any real customers.


Your blog post says:


Choosing UTF-8 aims to treat formatting text for communication with 
the user as "just a display issue". It's a low impact design that will 
"just work" for a lot of software, but it comes at a price:


  * because encoding consistency checks are mostly avoided, data in
different encodings may be freely concatenated and passed on to
other applications. Such data is typically not usable by the
receiving application.



I don't believe this is a necessary result of using UTF-8. It is a 
possible result, and I guess some implementations are using it this way, 
but a proper language could still provide and/or require proper usage of 
UTF-8 data through its type system just as Python3 is doing with PEP 
393.  In fact, if it were not for the requirement to support passing 
character strings in other formats (UTF-16, UTF-32) to historical APIs 
(in CPython add-on packages) and the resulting practical performance 
considerations of converting to/from UTF-8 repeatedly when calling those 
APIs, Python3 could have evolved to using UTF-8 as its underlying data 
format, and obtained equal encoding consistency as it has today.


Of course, nothing can be "required" if the user chooses to continue 
operating in the encoded domain, and manipulate data using the necessary 
byte-oriented features of of whatever language is in use.


One of the choices of Python3, was to retain character indexing as an 
underlying arithmetic implementation citing algorithmic speed, but that 
is a seldom needed operation, and of limited general applicability when 
considering grapheme clusters. An iterator based approach can solve both 
problems, but would have been best introduced as part of Python3.0, 
although it may have made 2to3 harder, and may have made it less 
practical to implement six and other "run on both Py2 and Py3" type 
solutions harder, without introducing those same iterative solutions 
into Python 2.6 or 2.7.


Such solutions could still be implemented as options. Even PEP 393 
grudgingly supports some use of UTF-8 when requested by the user, as I 
understand it. Whether such an implementation would be better based on 
bytes or str is uncertain without further analysis, although type 
checking would probably be easier if based on str. A high-performance 
implementation would likely need to be implemented at least partly in C 
rather than CPython, although it could be prototyped in Python for proof 
of functionality. The iterators could obviously be implemented to work 
based on top of solutions such as PEP 393, by simply using indexing 
underneath, when fixed-width characters are available, and other 
techniques when UTF-8 is the only available format (rather than 
converting from UTF-8 to fixed-width characters because of calling the 
iterator).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-27 Thread Nick Coghlan
On 27 August 2014 08:52, Nick Coghlan  wrote:
> On 27 Aug 2014 02:52, "Terry Reedy"  wrote:
>> Nick, I think the first half of your post is one of the clearest
>> expositions yet of 'why Python 3' (in particular, the str to unicode
>> change).  It is worthy of wider distribution and without much change, it
>> would be a great blog post.
>
> Indeed, I had the same idea - I had been assuming users already understood
> this context, which is almost certainly an invalid assumption.
>
> The blog post version is already mostly written, but I ran out of weekend.
> Will hopefully finish it up and post it some time in the next few days :)

Aaand, it's up:
http://www.curiousefficiency.org/posts/2014/08/multilingual-programming.html

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread Stephen J. Turnbull
Nikolaus Rath writes:

 > In that case, maybe it'd be nice to also explain why you use the
 > term "bilingual" for codepage based encoding.

Modern computing systems are written in languages which are invariably
based on syntax expressed using ASCII, and provide by default
functionality for expressing dates etc suitable for rendering American
English.  Thus ASCII (ie, American English) is always an available
language.  Code pages provide facilities for rendering one or more
languages languages sharing a common coded character set, but are
unsuitable for rendering most of the rest of the world's dozens of
language groups (grouping languages by common character set).

Multilingual has come to mean "able to express (almost) any set of
languages in a single text" (see, for example, Emacs's "HELLO" file),
not just "more than two".  So code pages are closer in spirit to
"bilingual" (two of many) than to "multilingual" (all of many).

It's messy, analogical terminology.  But then, natural language is
messy and analogical.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread Nikolaus Rath
Nick Coghlan  writes:
 As some examples of where bilingual computing breaks down:

 * My NFS client and server may have different locale settings
 * My FTP client and server may have different locale settings
 * My SSH client and server may have different locale settings
 * I save a file locally and send it to someone with a different locale
> setting
 * I attempt to access a Windows share from a Linux client (or
> vice-versa)
 * I clone my POSIX hosted git or Mercurial repository on a Windows
> client
 * I have to connect my Linux client to a Windows Active Directory
 domain (or vice-versa)
 * I have to interoperate between native code and JVM code

 The entire computing industry is currently struggling with this
 monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
 encoding/code pages) -> multilingual (Unicode) transition. It's been
 going on for decades, and it's still going to be quite some time
 before we're done.

 The POSIX world is slowly clawing its way towards a multilingual model
 that actually works: UTF-8
 Windows (including the CLR) and the JVM adopted a different
 multilingual model, but still one that actually works: UTF-16-LE
>>
>>
>> Nick, I think the first half of your post is one of the clearest
> expositions yet of 'why Python 3' (in particular, the str to unicode
> change).  It is worthy of wider distribution and without much change, it
> would be a great blog post.
>
> Indeed, I had the same idea - I had been assuming users already understood
> this context, which is almost certainly an invalid assumption.
>
> The blog post version is already mostly written, but I ran out of weekend.
> Will hopefully finish it up and post it some time in the next few days
> :)

In that case, maybe it'd be nice to also explain why you use the term
"bilingual" for codepage based encoding. At least to me, a
codepage/locale is pretty monolingual, or alternatively covering a whole
region (e.g. western europe). I figure with bilingual you mean ascii +
something, but that's mostly a guess from my side.


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread Nick Coghlan
On 27 Aug 2014 02:52, "Terry Reedy"  wrote:
>
> On 8/26/2014 9:11 AM, R. David Murray wrote:
>>
>> On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan 
wrote:
>>>
>>> As some examples of where bilingual computing breaks down:
>>>
>>> * My NFS client and server may have different locale settings
>>> * My FTP client and server may have different locale settings
>>> * My SSH client and server may have different locale settings
>>> * I save a file locally and send it to someone with a different locale
setting
>>> * I attempt to access a Windows share from a Linux client (or
vice-versa)
>>> * I clone my POSIX hosted git or Mercurial repository on a Windows
client
>>> * I have to connect my Linux client to a Windows Active Directory
>>> domain (or vice-versa)
>>> * I have to interoperate between native code and JVM code
>>>
>>> The entire computing industry is currently struggling with this
>>> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
>>> encoding/code pages) -> multilingual (Unicode) transition. It's been
>>> going on for decades, and it's still going to be quite some time
>>> before we're done.
>>>
>>> The POSIX world is slowly clawing its way towards a multilingual model
>>> that actually works: UTF-8
>>> Windows (including the CLR) and the JVM adopted a different
>>> multilingual model, but still one that actually works: UTF-16-LE
>
>
> Nick, I think the first half of your post is one of the clearest
expositions yet of 'why Python 3' (in particular, the str to unicode
change).  It is worthy of wider distribution and without much change, it
would be a great blog post.

Indeed, I had the same idea - I had been assuming users already understood
this context, which is almost certainly an invalid assumption.

The blog post version is already mostly written, but I ran out of weekend.
Will hopefully finish it up and post it some time in the next few days :)

>> This kind of puts the "length" of the python2->python3 transition
>> period in perspective, doesn't it?

I realised in writing the post that ASCII is over 50 years old at this
point, while Unicode as an official standard is more than 20. By the time
this is done, we'll likely be talking 30+ years for Unicode to displace the
confusing mess that is code pages and locale encodings :)

Cheers,
Nick.

>
>
> --
> Terry Jan Reedy
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread Terry Reedy

On 8/26/2014 9:11 AM, R. David Murray wrote:

On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan  wrote:

As some examples of where bilingual computing breaks down:

* My NFS client and server may have different locale settings
* My FTP client and server may have different locale settings
* My SSH client and server may have different locale settings
* I save a file locally and send it to someone with a different locale setting
* I attempt to access a Windows share from a Linux client (or vice-versa)
* I clone my POSIX hosted git or Mercurial repository on a Windows client
* I have to connect my Linux client to a Windows Active Directory
domain (or vice-versa)
* I have to interoperate between native code and JVM code

The entire computing industry is currently struggling with this
monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
encoding/code pages) -> multilingual (Unicode) transition. It's been
going on for decades, and it's still going to be quite some time
before we're done.

The POSIX world is slowly clawing its way towards a multilingual model
that actually works: UTF-8
Windows (including the CLR) and the JVM adopted a different
multilingual model, but still one that actually works: UTF-16-LE


Nick, I think the first half of your post is one of the clearest 
expositions yet of 'why Python 3' (in particular, the str to unicode 
change).  It is worthy of wider distribution and without much change, it 
would be a great blog post.



This kind of puts the "length" of the python2->python3 transition
period in perspective, doesn't it?


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread R. David Murray
On Sun, 24 Aug 2014 13:27:55 +1000, Nick Coghlan  wrote:
> As some examples of where bilingual computing breaks down:
> 
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> 
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> 
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE

This kind of puts the "length" of the python2->python3 transition
period in perspective, doesn't it?

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-26 Thread Martin v. Löwis
Am 24.08.14 03:11, schrieb Greg Ewing:
> Isaac Morland wrote:
>> In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF
>> (byte order mark) is used:
>>
>> http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration
>>
>> Not sure about XML.
> 
> According to Appendix F here:
> 
> http://www.w3.org/TR/xml/#sec-guessing
> 
> an XML parser needs to be prepared to try all the encodings it
> supports until it finds one that works well enough to decode
> the XML declaration, then it can find out the exact encoding
> used.

That's not what this section says. Instead, it says that
you need to auto-detect UCS-4, UTF-16, UTF-8 from the BOM,
or guess them or EBCDIC from the encoding of 'https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Stephen J. Turnbull
Isaac Morland writes:

 > I like your way of putting this - "straight face" indeed.  The third 
 > option really is a hack to allow working around nonsensical situations 
 > (and even the META tag is pretty questionable).  All this complexity 
 > because people can't be bothered to do things properly.

At least in Japan and Russia, doing things "properly" in your sense in
heterogenous distributed systems is really hard, requiring use of
rather fragile encoding detection heuristics that break at the
slightest whiff of encodings that are unusual in the particular
locale, and in Japan requiring equally fragile transcoding programs
that break on vendor charset variations.  The META "charset" attribute
is useful in those contexts, and the "charset" attribute for external
elements may have been useful in the past as well, although I've never
needed it.

I agree that an environment where "charset" attributes on META and
other elements are needed kinda sucks, but the prerequisite for "doing
things properly" is basically Unicode[1], and that just wasn't going
to happen until at least the 1990s.  To make the transition in less
than several decades would have required a degree of monopoly in
software production that I shudder to contemplate.  Even today there
are programmers around the world grumbling about having to deal with
the Unicode coded character set.


Footnotes: 
[1]  More precisely, a universal coded character set.  TRON code or
MULE code would have done (but yuck!)  ISO 2022 won't do!

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread R. David Murray
On Tue, 26 Aug 2014 11:25:19 +0900, "Stephen J. Turnbull"  
wrote:
> R. David Murray writes:
> 
>  > Also, as has been discussed in this thread previously, any program that
>  > deals with filenames is dealing with human readable languages, even
>  > if posix itself treats the filenames as bytes.
> 
> That's a bit extreme.  I can name two interesting applications
> offhand: git's object database and the Coda filesystem's containers.

As soon as I hit send I realized there were a few counter examples :)
So, replace "any" with "most".

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Stephen J. Turnbull
R. David Murray writes:

 > Also, as has been discussed in this thread previously, any program that
 > deals with filenames is dealing with human readable languages, even
 > if posix itself treats the filenames as bytes.

That's a bit extreme.  I can name two interesting applications
offhand: git's object database and the Coda filesystem's containers.

It's true that for debugging purposes bytestrings representing largish
numbers are readably encoded (in hexadecimal and decimal,
respectively), but they're clearly not "human readable" in the sense
you mean.

Nevertheless, these are the applications that prove your rule.  You
don't need the power of pathlib to conveniently (for the programmer)
and efficiently handle the file structures these programs use.
os.path is plenty.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Isaac Morland

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:


Isaac Morland :


 HTTP/1.1 200 OK
 Content-Type: text/html; charset=ISO-8859-1

 
 
 
 


For HTML it's not quite so bad.  According to the HTML 4 standard:
[...]

The Content-Type header takes precedence over a  element. I
thought I read once that the reason was to allow proxy servers to
transcode documents but I don't have a cite for that. Also, the 
element "must only be used when the character encoding is organized
such that ASCII-valued bytes stand for ASCII characters" so the
initial UTF-16 example wouldn't be conformant in HTML.


That's not how I read it:

  The META declaration must only be used when the character encoding is
  organized such that ASCII characters stand for themselves (at least
  until the META element is parsed). META declarations should appear as
  early as possible in the HEAD element.

  http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht
  ml#doc-char-set>

IOW, you must obey the HTTP character encoding until you have parsed a
conflicting META content-type declaration.



From the same document:


--
To sum up, conforming user agents must observe the following priorities 
when determining a document's character encoding (from highest priority to 
lowest):


An HTTP "charset" parameter in a "Content-Type" field.
A META declaration with "http-equiv" set to "Content-Type" and a value 
set for "charset".
The charset attribute set on an element that designates an external 
resource. 
--


(In the original they are numbered)

This is a priority list - if the Content-Type header gives a charset, it 
takes precedence, and all other sources for the encoding are ignored.  The 
"charset=" on an  or similar is only used if it is the only source 
for the encoding.


The "at least until the META element is parsed" bit allows for the use of 
encodings which make use of shifting.  So maybe they start out 
ASCII-compatible, but after a particular shift byte is seen those bytes 
now stand for Japanese Kanji characters until another shift byte is seen. 
This is allowed by the specification, as long as none of the 
non-ASCII-compatible stuff is seen before the META element.



The author of the standard keeps a straight face and continues:


I like your way of putting this - "straight face" indeed.  The third 
option really is a hack to allow working around nonsensical situations 
(and even the META tag is pretty questionable).  All this complexity 
because people can't be bothered to do things properly.



  For cases where neither the HTTP protocol nor the META element
  provides information about the character encoding of a document, HTML
  also provides the charset attribute on several elements. By combining
  these mechanisms, an author can greatly improve the chances that,
  when the user retrieves a resource, the user agent will recognize the
  character encoding.


Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread R. David Murray
On Sat, 23 Aug 2014 19:33:06 +0300, Marko Rauhamaa  wrote:
> "R. David Murray" :
> 
> > The same problem existed in python2 if your goal was to produce a stream
> > with a consistent encoding, but now python3 treats that as an error.
> 
> I have a different interpretation of the situation: as a rule, use byte
> strings in Python3. Text strings are a special corner case for
> applications that have to deal with human languages.

Clearly, then, you are writing unix (or perhaps posix)-only programs.

Also, as has been discussed in this thread previously, any program that
deals with filenames is dealing with human readable languages, even
if posix itself treats the filenames as bytes.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-25 Thread Oleg Broytman
Hi! Thank you very much, Nick, for long and detailed explanation!

On Sun, Aug 24, 2014 at 01:27:55PM +1000, Nick Coghlan  
wrote:
> On 24 August 2014 04:37, Oleg Broytman  wrote:
> > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore  
> > wrote:
> >> Generally, it seems to be mostly a reaction to the repeated claims
> >> that Python, or Windows, or whatever, is "broken".
> >
> >Ah, if that's the only problem I certainly can live with that. My
> > problem is that it *seems* this anti-Unix attitude infiltrates Python
> > core development. I very much hope I'm wrong and it really isn't.
> 
> The POSIX locale based approach to handling encodings is genuinely
> broken - it's almost as broken as code pages are on Windows. The
> fundamental flaw is that locales encourage *bilingual* computing:
> handling English plus one other language correctly. Given a global
> internet, bilingual computing *is a fundamentally broken approach*. We
> need multilingual computing (any human language, all the time), and
> that means Unicode.
> 
> As some examples of where bilingual computing breaks down:
> 
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
> 
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
> 
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE
> 
> POSIX is hampered by legacy ASCII defaults in various subsystems (most
> notably the default locale) and the assumption that system metadata is
> "just bytes" (an assumption that breaks down as soon as you have to
> hand that metadata over to another machine that may have different
> locale settings)
> Windows is hampered by the fact they kept the old 8-bit APIs around
> for backwards compatibility purposes, so applications using those APIs
> are still only bilingual (at best) rather than multilingual.
> JVM and CLR applications will at least handle the Basic Multilingual
> Plane (UCS-2) correctly, but may not correctly handle code points
> beyond the 16-bit boundary (this is the "Python narrow builds don't
> handle Unicode correctly" problem that was resolved for Python 3.3+ by
> PEP 393)
> 
> Individual users (including some organisations) may have the luxury of
> saying "well, all my clients and all my servers are POSIX, so I don't
> care about interoperability with other platforms". As the providers of
> a cross-platform runtime environment, we don't have that luxury - we
> need to figure out how to get *all* the major platforms playing nice
> with each other, regardless of whether they chose UTF-8 or UTF-16-LE
> as the basis for their approach towards providing multilingual
> computing environments.
> 
> Historically, that question of cross platform interoperability for
> open source software has been handled in a few different ways:
> 
> * Don't really interoperate with anybody, reinvent all the wheels (the JVM 
> way)
> * Emulate POSIX on Windows (the Cygwin/MinGW way)
> * Let the application developer figure it out (the Python 2 way)
> 
> The first approach is inordinately expensive - it took the resources
> of Sun in its heyday to make it possible, and it effectively locks the
> JVM out of certain kinds of computing (e.g. it's hard to do array
> oriented programming in JVM languages, because the CPU and GPU
> vectorisation features aren't readily accessible).
> 
> The second approach prevents the creation of truly native Windows
> applications, which makes it uncompelling as a way of attracting
> Windows users - it sends a clear signal that the project doesn't
> *really* care about supporting Windows as a platform, but instead only
> grudgingly accepts that there are Windows users out there that might
> like to use their software.
> 
> The third approach is the one we tried for a long time with Python 2,
> and essentially found to be an "experts only" solution. Yes, you can
> *make* it work, but the runtime isn't set up so it works *by default*.
> 
> The Unicode changes in Python 3 are a result of the Python core
> development team saying "it really shouldn't be this hard for
> application developers to get cross-platfo

Re: [Python-Dev] Bytes path support

2014-08-23 Thread Guido van Rossum
I declare this thread irreparably broken. Do not make any decisions in this
thread. Tell me (in another thread) when it's time to decide and I will.


On Sat, Aug 23, 2014 at 8:27 PM, Nick Coghlan  wrote:

> On 24 August 2014 04:37, Oleg Broytman  wrote:
> > On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore <
> p.f.mo...@gmail.com> wrote:
> >> Generally, it seems to be mostly a reaction to the repeated claims
> >> that Python, or Windows, or whatever, is "broken".
> >
> >Ah, if that's the only problem I certainly can live with that. My
> > problem is that it *seems* this anti-Unix attitude infiltrates Python
> > core development. I very much hope I'm wrong and it really isn't.
>
> The POSIX locale based approach to handling encodings is genuinely
> broken - it's almost as broken as code pages are on Windows. The
> fundamental flaw is that locales encourage *bilingual* computing:
> handling English plus one other language correctly. Given a global
> internet, bilingual computing *is a fundamentally broken approach*. We
> need multilingual computing (any human language, all the time), and
> that means Unicode.
>
> As some examples of where bilingual computing breaks down:
>
> * My NFS client and server may have different locale settings
> * My FTP client and server may have different locale settings
> * My SSH client and server may have different locale settings
> * I save a file locally and send it to someone with a different locale
> setting
> * I attempt to access a Windows share from a Linux client (or vice-versa)
> * I clone my POSIX hosted git or Mercurial repository on a Windows client
> * I have to connect my Linux client to a Windows Active Directory
> domain (or vice-versa)
> * I have to interoperate between native code and JVM code
>
> The entire computing industry is currently struggling with this
> monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
> encoding/code pages) -> multilingual (Unicode) transition. It's been
> going on for decades, and it's still going to be quite some time
> before we're done.
>
> The POSIX world is slowly clawing its way towards a multilingual model
> that actually works: UTF-8
> Windows (including the CLR) and the JVM adopted a different
> multilingual model, but still one that actually works: UTF-16-LE
>
> POSIX is hampered by legacy ASCII defaults in various subsystems (most
> notably the default locale) and the assumption that system metadata is
> "just bytes" (an assumption that breaks down as soon as you have to
> hand that metadata over to another machine that may have different
> locale settings)
> Windows is hampered by the fact they kept the old 8-bit APIs around
> for backwards compatibility purposes, so applications using those APIs
> are still only bilingual (at best) rather than multilingual.
> JVM and CLR applications will at least handle the Basic Multilingual
> Plane (UCS-2) correctly, but may not correctly handle code points
> beyond the 16-bit boundary (this is the "Python narrow builds don't
> handle Unicode correctly" problem that was resolved for Python 3.3+ by
> PEP 393)
>
> Individual users (including some organisations) may have the luxury of
> saying "well, all my clients and all my servers are POSIX, so I don't
> care about interoperability with other platforms". As the providers of
> a cross-platform runtime environment, we don't have that luxury - we
> need to figure out how to get *all* the major platforms playing nice
> with each other, regardless of whether they chose UTF-8 or UTF-16-LE
> as the basis for their approach towards providing multilingual
> computing environments.
>
> Historically, that question of cross platform interoperability for
> open source software has been handled in a few different ways:
>
> * Don't really interoperate with anybody, reinvent all the wheels (the JVM
> way)
> * Emulate POSIX on Windows (the Cygwin/MinGW way)
> * Let the application developer figure it out (the Python 2 way)
>
> The first approach is inordinately expensive - it took the resources
> of Sun in its heyday to make it possible, and it effectively locks the
> JVM out of certain kinds of computing (e.g. it's hard to do array
> oriented programming in JVM languages, because the CPU and GPU
> vectorisation features aren't readily accessible).
>
> The second approach prevents the creation of truly native Windows
> applications, which makes it uncompelling as a way of attracting
> Windows users - it sends a clear signal that the project doesn't
> *really* care about supporting Windows as a platform, but instead only
> grudgingly accepts that there are Windows users out there that might
> like to use their software.
>
> The third approach is the one we tried for a long time with Python 2,
> and essentially found to be an "experts only" solution. Yes, you can
> *make* it work, but the runtime isn't set up so it works *by default*.
>
> The Unicode changes in Python 3 are a result of the Python core
> development team saying 

Re: [Python-Dev] Bytes path support

2014-08-23 Thread Nick Coghlan
On 24 August 2014 04:37, Oleg Broytman  wrote:
> On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore  
> wrote:
>> Generally, it seems to be mostly a reaction to the repeated claims
>> that Python, or Windows, or whatever, is "broken".
>
>Ah, if that's the only problem I certainly can live with that. My
> problem is that it *seems* this anti-Unix attitude infiltrates Python
> core development. I very much hope I'm wrong and it really isn't.

The POSIX locale based approach to handling encodings is genuinely
broken - it's almost as broken as code pages are on Windows. The
fundamental flaw is that locales encourage *bilingual* computing:
handling English plus one other language correctly. Given a global
internet, bilingual computing *is a fundamentally broken approach*. We
need multilingual computing (any human language, all the time), and
that means Unicode.

As some examples of where bilingual computing breaks down:

* My NFS client and server may have different locale settings
* My FTP client and server may have different locale settings
* My SSH client and server may have different locale settings
* I save a file locally and send it to someone with a different locale setting
* I attempt to access a Windows share from a Linux client (or vice-versa)
* I clone my POSIX hosted git or Mercurial repository on a Windows client
* I have to connect my Linux client to a Windows Active Directory
domain (or vice-versa)
* I have to interoperate between native code and JVM code

The entire computing industry is currently struggling with this
monolingual (ASCII/Extended ASCII/EBCDIC/etc) -> bilingual (locale
encoding/code pages) -> multilingual (Unicode) transition. It's been
going on for decades, and it's still going to be quite some time
before we're done.

The POSIX world is slowly clawing its way towards a multilingual model
that actually works: UTF-8
Windows (including the CLR) and the JVM adopted a different
multilingual model, but still one that actually works: UTF-16-LE

POSIX is hampered by legacy ASCII defaults in various subsystems (most
notably the default locale) and the assumption that system metadata is
"just bytes" (an assumption that breaks down as soon as you have to
hand that metadata over to another machine that may have different
locale settings)
Windows is hampered by the fact they kept the old 8-bit APIs around
for backwards compatibility purposes, so applications using those APIs
are still only bilingual (at best) rather than multilingual.
JVM and CLR applications will at least handle the Basic Multilingual
Plane (UCS-2) correctly, but may not correctly handle code points
beyond the 16-bit boundary (this is the "Python narrow builds don't
handle Unicode correctly" problem that was resolved for Python 3.3+ by
PEP 393)

Individual users (including some organisations) may have the luxury of
saying "well, all my clients and all my servers are POSIX, so I don't
care about interoperability with other platforms". As the providers of
a cross-platform runtime environment, we don't have that luxury - we
need to figure out how to get *all* the major platforms playing nice
with each other, regardless of whether they chose UTF-8 or UTF-16-LE
as the basis for their approach towards providing multilingual
computing environments.

Historically, that question of cross platform interoperability for
open source software has been handled in a few different ways:

* Don't really interoperate with anybody, reinvent all the wheels (the JVM way)
* Emulate POSIX on Windows (the Cygwin/MinGW way)
* Let the application developer figure it out (the Python 2 way)

The first approach is inordinately expensive - it took the resources
of Sun in its heyday to make it possible, and it effectively locks the
JVM out of certain kinds of computing (e.g. it's hard to do array
oriented programming in JVM languages, because the CPU and GPU
vectorisation features aren't readily accessible).

The second approach prevents the creation of truly native Windows
applications, which makes it uncompelling as a way of attracting
Windows users - it sends a clear signal that the project doesn't
*really* care about supporting Windows as a platform, but instead only
grudgingly accepts that there are Windows users out there that might
like to use their software.

The third approach is the one we tried for a long time with Python 2,
and essentially found to be an "experts only" solution. Yes, you can
*make* it work, but the runtime isn't set up so it works *by default*.

The Unicode changes in Python 3 are a result of the Python core
development team saying "it really shouldn't be this hard for
application developers to get cross-platform interoperability between
correctly configured systems when dealing solely with correctly
encoded data and metadata". The idea of Python 3 is that applications
should require additional complexity solely to deal with *incorrectly*
configured systems and improperly encoded data and metadata (and,
ideally, the dete

Re: [Python-Dev] Bytes path support

2014-08-23 Thread Greg Ewing

Isaac Morland wrote:
In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF 
(byte order mark) is used:


http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration

Not sure about XML.


According to Appendix F here:

http://www.w3.org/TR/xml/#sec-guessing

an XML parser needs to be prepared to try all the encodings it
supports until it finds one that works well enough to decode
the XML declaration, then it can find out the exact encoding
used.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Paul Moore
On 23 August 2014 19:37, Oleg Broytman  wrote:
> Unix takes the idea that everything is text and a stream of bytes to
> its extreme.

I don't really understand the idea of "text and a stream of bytes".
The two are fundamentally different in my view. But I guess that's why
we have to agree to differ - our perspectives are just very different.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Oleg Broytman
Hi!

On Sat, Aug 23, 2014 at 06:40:37PM +0100, Paul Moore  
wrote:
> On 23 August 2014 16:15, Oleg Broytman  wrote:
> > On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" 
> >  wrote:
> >> And that's the big problem with Oleg's complaint, too.  It's not at
> >> all clear what he wants
> >
> >The first thing is I want to understand why people continue to refer
> > to Unix was as "broken". Better yet, to persuade them it's not.

   "Unix was" => "Unix way"

> Generally, it seems to be mostly a reaction to the repeated claims
> that Python, or Windows, or whatever, is "broken".

   Ah, if that's the only problem I certainly can live with that. My
problem is that it *seems* this anti-Unix attitude infiltrates Python
core development. I very much hope I'm wrong and it really isn't.

> Unix advocates (not
> yourself) are prone to declaring anything *other* than the Unix model
> as "broken", so it's tempting to give them a taste of their own
> medicine. Sorry for that (to the extent that I was one of the people
> doing so).

   You didn't see me in my younger years. I surely was one of those
Windows bashers. Please take my apology.

> Rhetoric aside, none of Unix, Windows or Python are "broken". They
> just react in different ways to fundamentally difficult edge cases.
> 
> But expecting Python (a cross-platform language) to prefer the Unix
> model is putting all the pain on non-Unix users of Python, which I
> don't feel is reasonable. Let's all compromise a little.
> 
> Paul
> 
> PS The key thing *I* think is a problem with the Unix behaviour is
> that it treats filenames as bytes rather than Unicode. People name
> files using *characters*. So every filename is semantically text, in
> the mind of the person who created it. Unix enforces a transformation
> to bytes, but does not retain the encoding of those bytes. So
> information about the original author's intent is lost. But that's a
> historical fact, baked into Unix at a low level. Whether that's
> "broken" or just "something to deal with" is not important to me.

   The problem is hardly specific to Unix. Despite Joel Spolsky's "There
Ain't No Such Thing As Plain Text" people create text files all the
time. Without specifying an encoding. And put filenames into those text
files (audio playlists, like .m3u and .pls are just text files with
pathnames).
   Unix takes the idea that everything is text and a stream of bytes to
its extreme.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Paul Moore
On 23 August 2014 16:15, Oleg Broytman  wrote:
> On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" 
>  wrote:
>> And that's the big problem with Oleg's complaint, too.  It's not at
>> all clear what he wants
>
>The first thing is I want to understand why people continue to refer
> to Unix was as "broken". Better yet, to persuade them it's not.

Generally, it seems to be mostly a reaction to the repeated claims
that Python, or Windows, or whatever, is "broken". Unix advocates (not
yourself) are prone to declaring anything *other* than the Unix model
as "broken", so it's tempting to give them a taste of their own
medicine. Sorry for that (to the extent that I was one of the people
doing so).

Rhetoric aside, none of Unix, Windows or Python are "broken". They
just react in different ways to fundamentally difficult edge cases.

But expecting Python (a cross-platform language) to prefer the Unix
model is putting all the pain on non-Unix users of Python, which I
don't feel is reasonable. Let's all compromise a little.

Paul

PS The key thing *I* think is a problem with the Unix behaviour is
that it treats filenames as bytes rather than Unicode. People name
files using *characters*. So every filename is semantically text, in
the mind of the person who created it. Unix enforces a transformation
to bytes, but does not retain the encoding of those bytes. So
information about the original author's intent is lost. But that's a
historical fact, baked into Unix at a low level. Whether that's
"broken" or just "something to deal with" is not important to me.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Marko Rauhamaa
"R. David Murray" :

> The same problem existed in python2 if your goal was to produce a stream
> with a consistent encoding, but now python3 treats that as an error.

I have a different interpretation of the situation: as a rule, use byte
strings in Python3. Text strings are a special corner case for
applications that have to deal with human languages.

If your application has to talk SMTP, use bytes.

If your application has to do IPC, use bytes.

If your application has to do file I/O, use bytes.

If your application is a word processor or an IM client, you have text
strings available. You might find, though, that barely any modern GUI
application is satisfied with crude text strings. You will need weights,
styles, sizes, emoticons, positions, directions, shadows, alignment etc
etc so it may be that Python's text strings are only good enough for
storing individual characters or short snippets.

In sum, Python's text strings might have one sweet spot: Usenet clients.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Isaac Morland

On Sat, 23 Aug 2014, Marko Rauhamaa wrote:


"Stephen J. Turnbull" :


Just read as bytes and decode piecewise in one way or another. For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points


HTML and XML are interesting examples since their encoding is initially
unknown:

 
 ^
 +--- Now I know it is UTF-8

 
 ^
 +--- Now I know it was UTF-16
  all along!

Then we have:


 HTTP/1.1 200 OK
 Content-Type: text/html; charset=ISO-8859-1

 
 
 
 

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


For HTML it's not quite so bad.  According to the HTML 4 standard:

http://www.w3.org/TR/html4/charset.html

The Content-Type header takes precedence over a  element.  I thought 
I read once that the reason was to allow proxy servers to transcode 
documents but I don't have a cite for that.  Also, the  element 
"must only be used when the character encoding is organized such that 
ASCII-valued bytes stand for ASCII characters" so the initial UTF-16 
example wouldn't be conformant in HTML.


In HTML 5 it allows non-ASCII-compatible encodings as long as U+FEFF (byte 
order mark) is used:


http://www.w3.org/TR/html-markup/syntax.html#encoding-declaration

Not sure about XML.

Of course this whole area is a bit of an "arms race" between programmers 
competing to get away with being as sloppy as possible and other 
programmers who have to deal with their mess.


Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Oleg Broytman
On Sat, Aug 23, 2014 at 07:14:47PM +0900, "Stephen J. Turnbull" 
 wrote:
> I cannot believe you are going to find a better environment for
> dealing with these issues than Python 3.

   Well, that's may be.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Oleg Broytman
On Sat, Aug 23, 2014 at 06:02:06PM +0900, "Stephen J. Turnbull" 
 wrote:
> And that's the big problem with Oleg's complaint, too.  It's not at
> all clear what he wants

   The first thing is I want to understand why people continue to refer
to Unix was as "broken". Better yet, to persuade them it's not.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread R. David Murray
On Sat, 23 Aug 2014 21:08:29 +1000, Steven D'Aprano  wrote:
> When I started this email, I originally began to say that the actual 
> problem was with byte file names that cannot be decoded into Unicode 
> using the system encoding (typically UTF-8 on Linux systems. But I've 
> actually had difficulty demonstrating that it actually is a problem. I 
> started with a byte sequence which is invalid UTF-8, namely:
> 
> b'ZZ\xdb\xdf\xfa\xff'
> 
> created a file with that name, and then tried listing it with 
> os.listdir. Even in Python 3.1 it worked fine. I was able to list the 
> directory and open the file, so I'm not entirely sure where the problem 
> lies exactly. Can somebody demonstrate the failure mode?

The "failure" happens only when you try to cross from the domain of
posix binary filenames into the domain of text streams (that is, a
stream with a consistent encoding).  If you stick with os interfaces
that handle filenames, Python3 handles posix bytes filenames just fine
(though there may be a few corner-case rough edges yet to be fixed, and
the standard streams was one of them).

The difficultly comes if you try to use a filename that contains
undecodable bytes in a non-os-interface text-context (such as writing it
to a text file that you have declared to be a utf-8 encoding): there you
will get an error...not completely unlike the old "your code works until
your user uses unicode" problem we had in python2, but in this case only
happening in a very narrow set of circumstances involving trying to
translate between one domain (posix binary filenames) and another domain
(io streams with a consistent declared encoding).  This is not a common
operation, but appears to be the one Oleg is concerned about. The old
unicode-blowup errors would happen almost any time someone with a
non-ascii language tried to use a program written by an ascii-only
programmer (which was most of us).

The same problem existed in python2 if your goal was to produce a stream
with a consistent encoding, but now python3 treats that as an error.  If
you really want a stream with an inconsistent encoding, open it as
binary and use the surrogate escape error handler to recover the bytes
in the filenames.  That is, *be explicit* about your intentions.

So yes, we've shifted a burden from those who want non-ascii text to
work consistently to those who wanted inconsistently encoded text to "just
work" (or rather *appear* to "just work").  The number of people who
benefit from the improved text model *greatly* outweighs the number of
people inconvenienced by the new strictness when the domain line (posix
binary filenames to consistently encoded text stream) are crossed.  And
the result is more *valid* programs, and fewer unexpected errors
overall, with no inconvenience unless that domain line is crossed,
and even then the inconvenience is limited to the open call that creates
the binary stream.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Steven D'Aprano
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker wrote:

> The point is that if you are reading a file name from the system, and then
> passing it back to the system, then you can treat it as just bytes -- who
> cares? And if you add the byte value of 47 thing, then you can even do
> basic path manipulations. But once you want to do other things with your
> file name, then you need to know the encoding. And it is very, very common
> for users to need to do other things with filenames, and they almost always
> want them as text that they can read and understand.
> 
> Python3 supports this case very well. But it does indeed make it hard to
> work with filenames when you don't know the encoding they are in.

Just "not knowing" is not sufficient. In that case, you'll likely get a 
Unicode string containing moji-bake:

# I write a file name using UTF-8 on my system:
filename = 'music by Наӥв.txt'.encode('utf-8')
# You try to use it assuming ISO-8859-7 (Greek)
filename.decode('iso-8859-7')
=> 'music by Π\x9dΠ°Σ₯Π².txt'

which, even though it looks wrong, still lets you refer to the file 
(provided you then encode back to bytes with ISO-8859-7 again). This 
won't always be the case, sometimes the encoding you guess will be 
wrong.

When I started this email, I originally began to say that the actual 
problem was with byte file names that cannot be decoded into Unicode 
using the system encoding (typically UTF-8 on Linux systems. But I've 
actually had difficulty demonstrating that it actually is a problem. I 
started with a byte sequence which is invalid UTF-8, namely:

b'ZZ\xdb\xdf\xfa\xff'

created a file with that name, and then tried listing it with 
os.listdir. Even in Python 3.1 it worked fine. I was able to list the 
directory and open the file, so I'm not entirely sure where the problem 
lies exactly. Can somebody demonstrate the failure mode?


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Stephen J. Turnbull
Oleg Broytman writes:

 >This is the core of the problem. Python2 favors Unix model but
 > Windows people pays the price. Python3 reverses that

This is certainly not true.  What is true is that Python 3 makes no
attempt to make it easy to write crappy software in the old Unix
style, that breaks when unexpected character encoding are encountered.
Python 3 is designed to make it easier to write reliable software,
even if it will only ever be used on one platform.  Nevertheless, it's
still a reasonable language for writing byte-shoveling software, with
the last piece in place as of the acceptance of PEP 461.

As of that PEP, you can use regexps for tokenizing byte streams and
%-formatting to conveniently produce them.  If you want to treat them
piecewise as character streams with different encodings, you have a
large library of codecs, which provide an incremental decoder
interface.  While AFAIK no codec implements a decode-until-error mode,
that's not all that much of a loss, as many encodings overlap.  Eg, if
you start decoding using a latin-1 codec, decoding the whole document
will succeed, even if it switches to windows-1251 in the meantime.

Oleg, I gather Russian is your native language.  That's moderately
complicated, I admit.  But the Russians are a distant second to the
Japanese in self-destructive proliferation of incompatible character
coding standards and non-standard variants.  After 24 years of dealing
with the mess that is East Asian encodings (which is even bound up
with the "religion" of Japanese exceptionalism -- some Japanese have
argued that there is a spiritual superiority to Japanese JIS codes!),
I cannot believe you are going to find a better environment for
dealing with these issues than Python 3.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Marko Rauhamaa
Isaac Morland :

>>  HTTP/1.1 200 OK
>>  Content-Type: text/html; charset=ISO-8859-1
>>
>>  
>>  
>>  
>>  
>
> For HTML it's not quite so bad.  According to the HTML 4 standard:
> [...]
>
> The Content-Type header takes precedence over a  element. I
> thought I read once that the reason was to allow proxy servers to
> transcode documents but I don't have a cite for that. Also, the 
> element "must only be used when the character encoding is organized
> such that ASCII-valued bytes stand for ASCII characters" so the
> initial UTF-16 example wouldn't be conformant in HTML.

That's not how I read it:

   The META declaration must only be used when the character encoding is
   organized such that ASCII characters stand for themselves (at least
   until the META element is parsed). META declarations should appear as
   early as possible in the HEAD element.

   http://www.w3.org/TR/1998/REC-html40-19980424/charset.ht
   ml#doc-char-set>

IOW, you must obey the HTTP character encoding until you have parsed a
conflicting META content-type declaration.

The author of the standard keeps a straight face and continues:

   For cases where neither the HTTP protocol nor the META element
   provides information about the character encoding of a document, HTML
   also provides the charset attribute on several elements. By combining
   these mechanisms, an author can greatly improve the chances that,
   when the user retrieves a resource, the user agent will recognize the
   character encoding.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Chris Angelico
On Sat, Aug 23, 2014 at 7:02 PM, Stephen J. Turnbull  wrote:
> Chris Barker writes:
>
>  > So I write bytes that are encoded one way into a text file that's encoded
>  > another way, and expect to be abel to read that later?
>
> No, not you.  Crap software does that.  Your MUD server.  Oleg's
> favorite web pages with ads, or more likely the ad servers.

Just to clarify: Presumably you're referring to my previous post
regarding my MUD client's heuristic handling of broken encodings. It's
"my server" in the sense of the one that I'm connecting to, and not in
the sense that I control it. I do also run a MUD server, and it
guarantees that everything it sends is UTF-8. (Incidentally, that
server has the exact same set of heuristics for coping with broken
encodings from other clients. There's no escaping it.) Your point is
absolutely right: mess like that is to cope with the fact that there's
broken stuff out there.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Marko Rauhamaa
"Stephen J. Turnbull" :

> Just read as bytes and decode piecewise in one way or another. For
> Oleg's HTML case, there's a well-understood structure that can be used
> to determine retry points

HTML and XML are interesting examples since their encoding is initially
unknown:

  
  ^
  +--- Now I know it is UTF-8

  
  ^
  +--- Now I know it was UTF-16
   all along!

Then we have:


  HTTP/1.1 200 OK
  Content-Type: text/html; charset=ISO-8859-1

  
  
  
  

See how deep you have to parse the TCP stream before you realize the
content encoding is UTF-16.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Stephen J. Turnbull
Chris Barker writes:

 > So I write bytes that are encoded one way into a text file that's encoded
 > another way, and expect to be abel to read that later?

No, not you.  Crap software does that.  Your MUD server.  Oleg's
favorite web pages with ads, or more likely the ad servers.

 > Not for me (or many other users) -- terminals are sometimes set
 > with ascii-only encoding,

So?  That means you can't handle text files in general, only those
restricted to ASCII.  That's a completely different issue.

 > Python3 supports this case very well. But it does indeed make it
 > hard to work with filenames when you don't know the encoding they
 > are in.

No, it doesn't.  Reasonably handling "text streams" in unknown,
possibly multiple, encodings is just hard.  Python 3 has nothing to do
with it, and Oleg should know that very well.

It's true that code written in Python 2 to handle these issues needs
to be ported to Python 3.  Things is, Oleg says "another tool" -- any
non-Python-2 tool will need porting of his code too.

 > And apparently that's pretty common -- or common enough that it
 > would be nice for Python to support it well. This trick is how --
 > we'd like the "just pass it around and do path manipulations" case
 > to work with (almost) arbitrary bytes,

It does.  That's what os.path is for.

 > but everything else to work naturally with text (unicode text).

No gloss, please.  It's text, period.  The internal Unicode encoding
is *not exposed*, with a few (important) exceptions such as Han
unification.

 > I think the way to do this is to abstract the path concept, like pathlib
 > does.

You forgot to append the word "well".

 > From my personal experience, non-ascii filenames are much easier to
 > deal with if I use unicode for filenames everywhere (py2). Somehow,
 > I have yet to be bitten by mixed encoding in filenames.

.gov domain?  ASCII-only terminal settings?  It's not "somehow", it's
that you live a sheltered life.

 > So will using a surrogate-escape error handling with pathlib make
 > all this just work?

Not answerable until you define "all this" more precisely.

And that's the big problem with Oleg's complaint, too.  It's not at
all clear what he wants, except that all of his current code should
continue to work in Python 3.  Just like all of us.  The question then
is persuading him that it's worth moving to Python 3 despite the
effort of porting Python-2-specific code.  Maybe he can be persuaded,
maybe not.  Python 2 is a better than average language.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Not sure why 1251,

All of those codes have repertoires that are Cyrillic supersets,
presumably Russian-language content, based on Oleg's top domain.

 > But it's important to note that this is a method of handling junk.
 > It's not a design intention; this is for a situation where I really
 > want to cope with any byte stream and attempt to display it as text.
 > And if I get something that's neither UTF-8 nor CP-1252, I will
 > display it wrongly, and there's nothing can be done about that.

Of course there is.  It just gets more heuristic the more numerous the
potential encodings are.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-23 Thread Stephen J. Turnbull
Chris Barker writes:

 > > The third is to specify the UTF-8 with the surrogate escape error
 > > handler.  This allows non-UTF-8 codes to be loaded into
 > > memory.

Read as bytes and incrementally decode.  If you hit an Exception,
retry from that point.

 > Just so I'm clear here -- if you write that back out, encoded as
 > utf-8 -- you'll get the exact same binary blob out as came in?

If and only if there are no changes to the content.

 > I wonder if this would make it hard to preserve byte boundaries,
 > though.

I'm not sure what you mean by "byte boundaries".  If you mean
after concatenation of such objects, yes, the uninterpretable bytes
will be encoded in such a way as to be identifiable as lone bytes;
they won't be interpreted as Unicode characters.

 > By the way, IIUC correctly, you can also use the python latin-1
 > decoder -- anything latin-1 will come through correctly, anything
 > not valid latin-1 will come in as garbage, but if you re-encode
 > with latin-1 the original bytes will be preserved. I think this
 > will also preserve a 1:1 relationship between character count and
 > byte count, which could be handy.

Bad idea, especially for Oleg's use case -- you can't decode those by
codec without reencoding to bytes first.  No point in abandoning
codecs just because there isn't one designed for his use case exactly.
Just read as bytes and decode piecewise in one way or another.  For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points and a very few plausible coding systems,
which can be fairly well distinguished by the range of bytes used and
probably nearly perfectly with additional information from the
structure and distribution of apparently decoded characters.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread R. David Murray
On Sat, 23 Aug 2014 00:21:18 +0200, Oleg Broytman  wrote:
>I'm involved in developing and maintaining a few big commercial
> projects that will hardly be ported to Python3. So I'm stuck with
> Python2 for many years and I haven't tried Python3. May be I should try
> a small personal project, but certainly not this year. May be the next
> one...

Yes, you should try it.  Really, it's not the monster you are
constructing in your mind.  The functions that read filenames and return
them as text use surrogate escape to preserve the bytes, and the
functions that accept filenames use surrogate escape to recover those
bytes before passing them back to the OS.  So posix binary filenames
just work, as long as the only thing you depend on is being able to
split and join them on the / character (and possibly the . character)
and otherwise treat the names as black boxes...which is exactly the same
situation you are in in python2.

If you need to read filenames out of a file, you'll need to specify the
surrogate escape error handler so that the bytes will be there to be
recovered when you pass them to the file system functions, but it will
work.

Or, as discussed, you can treat them as binary and use the os level
functions that accept binary input (which are exactly the ones you are
used to using in python2).  This includes os.path.split and
os.path.join, which as noted are the only things you can depend on
working correctly when you don't know the encoding of the filenames.

So, the way to look at this is that python3 is no worse[1] than python2 for
handling posix binary filenames, and also provides additional features
if you *do* know the correct encoding of the filenames.

--David

[1] modulo any remaining API bugs, which is exactly where this thread
started: trying to figure out which APIs need to be able to handle
binary paths and/or surrogate escaped paths so that posix filenames
consistently work as well in python3 as they did in python2).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Angelico
On Sat, Aug 23, 2014 at 8:26 AM, Oleg Broytman  wrote:
> On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico  
> wrote:
>> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  
>> wrote:
>> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
>> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
>>
>> I'd assume "or" meant there, rather than "of", it's a common typo.
>>
>> Not sure why 1251, specifically
>
>This is the encoding of Russian Windows. Files and emails in Russia
> are mostly in cp1251 encoding; something like 60-70%, I think. The
> second popular encoding is cp866 (Russian DOS); it's used by Windows as
> OEM encoding.

Yeah, that makes sense. In any case, you pick one "most likely" 8-bit
encoding and go with it.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Sat, Aug 23, 2014 at 07:04:20AM +1000, Chris Angelico  
wrote:
> On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  
> wrote:
> > "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> > utf-8, but it is not both. Maybe you meant "or" instead of "of".
> 
> I'd assume "or" meant there, rather than "of", it's a common typo.
> 
> Not sure why 1251, specifically

   This is the encoding of Russian Windows. Files and emails in Russia
are mostly in cp1251 encoding; something like 60-70%, I think. The
second popular encoding is cp866 (Russian DOS); it's used by Windows as
OEM encoding.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 11:53:01AM -0700, Chris Barker  
wrote:
> Back in the day, paths were "just strings", and that worked OK with
> py2 strings, because you could put arbitrary bytes in them. But the "py2
> strings were perfect" folks seem to not acknowledge that while they are
> nice for matching the posix filename model, they were a pain in the neck
> when you needed to do somethign else like write them in to a JSON file or
> something.

   This is the core of the problem. Python2 favors Unix model but
Windows people pays the price. Python3 reverses that and I'm still
thinking if I want to pay the new price.

> So will using a surrogate-escape error handling with pathlib make all this
> just work?

   I'm involved in developing and maintaining a few big commercial
projects that will hardly be ported to Python3. So I'm stuck with
Python2 for many years and I haven't tried Python3. May be I should try
a small personal project, but certainly not this year. May be the next
one...

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 01:17:44PM -0700, Glenn Linderman 
 wrote:
> >in cp1251 of utf-8 encoding
> 
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or
> it is utf-8, but it is not both. Maybe you meant "or" instead of
> "of".

   But of course!

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Angelico
On Sat, Aug 23, 2014 at 6:17 AM, Glenn Linderman  wrote:
> "cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
> utf-8, but it is not both. Maybe you meant "or" instead of "of".

I'd assume "or" meant there, rather than "of", it's a common typo.

Not sure why 1251, specifically, but it's not uncommon for boundary
code to attempt a decode that consists of something like "attempt
UTF-8 decode, and if that fails, attempt an eight-bit decode". For my
MUD clients, that's pretty much required; one of the servers I
frequent is completely bytes-oriented, so whatever encoding one client
uses will be dutifully echoed to every other client. There are some
that correctly use UTF-8, but others use whatever they feel like; and
since those naughty clients are mainly on Windows, I can reasonably
guess that they'll be using CP-1252. So that's what I do: UTF-8,
fall-back on 1252. (It's also possible some clients will be using
Latin-1, but 1252 is a superset of that.)

But it's important to note that this is a method of handling junk.
It's not a design intention; this is for a situation where I really
want to cope with any byte stream and attempt to display it as text.
And if I get something that's neither UTF-8 nor CP-1252, I will
display it wrongly, and there's nothing can be done about that.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Barker
On Thu, Aug 21, 2014 at 7:42 PM, Oleg Broytman  wrote:

> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal <
> chris.bar...@noaa.gov> wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
>
>There is no such thing as an encoding of text files. So we just
> write those bytes to the file


So I write bytes that are encoded one way into a text file that's encoded
another way, and expect to be abel to read that later? you're kidding,
right? Only if that's  he only thing in the file -- usually not the case
with my text files.

or output them to the terminal. I often do
> that. My filesystems are full of files with names and content in
> at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
> terminal with koi8 or utf-8 locale and fonts and some file always look
> weird. But however weird they are it's possible to work with them.
>

Not for me (or many other users) -- terminals are sometimes set with
ascii-only encoding, so non-ascii barfs -- or you get some weird control
characters that mess up your terminal -- dumping arbitrary bytes to a
terminal does not always "just work".


> > And people still want to say posix isn't broken in this regard?
>
>Not at all! And broken or not broken it's what I (for many different
> reasons) prefer to use for my desktops, servers, notebooks, routers and
> smartphones,


Sorry -- that's a Red Herring -- I agree, "broken" or "simple and
consistent" is irrelevant, we all want Python to work as well as it can on
such systems.

The point is that if you are reading a file name from the system, and then
passing it back to the system, then you can treat it as just bytes -- who
cares? And if you add the byte value of 47 thing, then you can even do
basic path manipulations. But once you want to do other things with your
file name, then you need to know the encoding. And it is very, very common
for users to need to do other things with filenames, and they almost always
want them as text that they can read and understand.

Python3 supports this case very well. But it does indeed make it hard to
work with filenames when you don't know the encoding they are in. And
apparently that's pretty common -- or common enough that it would be nice
for Python to support it well. This trick is how -- we'd like the "just
pass it around and do path manipulations" case to work with (almost)
arbitrary bytes, but everything else to work naturally with text (unicode
text).

Which brings us to the "what APIs should accept bytes" question. I think
that's been pretty much answered: All the low-level ones, so that protocol
and library programmers can write code that works on systems with undefined
filename encodings.

But: casual users still need to do the normal things with file names and
paths, and ideally those should work the same way on all systems.

I think the way to do this is to abstract the path concept, like pathlib
does. Back in the day, paths were "just strings", and that worked OK with
py2 strings, because you could put arbitrary bytes in them. But the "py2
strings were perfect" folks seem to not acknowledge that while they are
nice for matching the posix filename model, they were a pain in the neck
when you needed to do somethign else like write them in to a JSON file or
something. From my personal experience, non-ascii filenames are much easier
to deal with if I use unicode for filenames everywhere (py2). Somehow, I
have yet to be bitten by mixed encoding in filenames.

So will using a surrogate-escape error handling with pathlib make all this
just work?

-Chris

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Chris Barker
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman 
wrote:

> What encoding does have a text file (an HTML, to be precise) with
> text in utf-8, ads in cp1251 (ad blocks were included from different
> files) and comments in koi8-r?
>Well, I must admit the HTML was rather an exception, but having a
> text file with some strange characters (binary strings, or paragraphs
> in different encodings) is not that exceptional.
>
>  That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.
>
> Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
>
> First -- we're getting OT here -- this thread was about file and path
names, not the contents of files. But I suppose I brought that in when I
talked about writing file names to files...

The first I'll mention is the one that follows from my description of what
> your file really is: Python3 allows opening files in binary mode, and then
> decoding various sections of it using whatever encoding you like, using the
> bytes.decode() operation on various sections of the file. Determination of
> which sections are in which encodings is beyond the scope of this
> description of the technique, and is application dependent.
>

right -- and you would have wanted to open such file in binary mode with
py2 as well, but in that case, you's have the contents in py2 string
object, which has a few more convenient ways to work with text (at least
ascii-compatible) than the py3 bytes object does.

The third is to specify the UTF-8 with the surrogate escape error handler.
> This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as
> smart as you, could perhaps be developed to detect and manipulate the
> resulting "lone surrogate" codes in meaningful ways, or could simply allow
> them to ride along without interpretation, and be emitted as the original,
> into other files.
>

Just so I'm clear here -- if you write that back out, encoded as utf-8 --
you'll get the exact same binary blob out as came in?

I wonder if this would make it hard to preserve byte boundaries, though.

By the way, IIUC correctly, you can also use the python latin-1 decoder --
anything latin-1 will come through correctly, anything not valid latin-1
will come in as garbage, but if you re-encode with latin-1 the original
bytes will be preserved. I think this will also preserve a 1:1 relationship
between character count and byte count, which could be handy.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 11:50 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.

Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

I was not declaring your file not to be a "text file" from any
definition obtained from Python3 documentation, just from a common
sense definition of "text file".

And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.


Until you know or can deduce the encoding of a file, it is binary. If it 
has multiple, different, embedded encodings of text, it is still binary. 
In my opinion. So these are just opinions, and naming conventions. If 
you call it text, you have a different definition of text file than I do.





Looking at it from Python3, though, it is clear that when opening a
file in "text" mode, an encoding may be specified or will be
assumed.  That is one encoding, applying to the whole file, not 3
encodings, with declarations on when to switch between them. So I
think, in general, Python3 assumes or defines a definition of text
file that matches my "common sense" definition.

I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.


Python3 is not trying to get rid of byte strings. But to some extent, it 
is wanting to treat bytes as non-text... bytes can be encoded text, but 
is not text until it is decoded. There is some processing that can be 
done on encoded text, but it has to be done differently (in many cases) 
than processing done on (non-encoded) text.


One difference is the interpretation of what character is what varies 
from encoding to encoding, so if the processing requires understanding 
the characters, then the character code must be known.


On the other hand, if it suffices to detect blocks of opaque text 
delimited by a known set of delimiters codes (EOL: CR, LF, combinations 
thereof) then that can be done relatively easily on binary, as long as 
the encoding doesn't have data puns where a multibyte encoded character 
might contain the code for the delimiter as one of the bytes of the code 
for the character.



On the other hand, Python3 provides various facilities for working
with such files.

The first I'll mention is the one that follows from my description
of what your file really is: Python3 allows opening files in binary
mode, and then decoding various sections of it using whatever
encoding you like, using the bytes.decode() operation on various
sections of the file. Determination of which sections are in which
encodings is beyond the scope of this description of the technique,
and is application dependent.

This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.


If the file names are in an unknown encoding, both in the directory and 
in the encoded text in the file listing, then unless you can deduce the 
encoding, you would be limited to doing manipulations with file APIs 
that support bytes, the low-level ones, yes.  If you can deduce the 
encoding, then you are freed from that limitation.



Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 

Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman 
 wrote:
> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
> > wrote:
> >>On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >>>What encoding does have a text file (an HTML, to be precise) with
> >>>text in utf-8, ads in cp1251 (ad blocks were included from different
> >>>files) and comments in koi8-r?
> >>>Well, I must admit the HTML was rather an exception, but having a
> >>>text file with some strange characters (binary strings, or paragraphs
> >>>in different encodings) is not that exceptional.
> >>That's not a text file. That's a binary file containing (hopefully
> >>delimited, and documented) sections of encoded text in different
> >>encodings.
> >Allow me to disagree. For me, this is a text file which I can (and
> >do) view with a pager, edit with a text editor, list on a console,
> >search with grep and so on. If it is not a text file by strict Python3
> >standards then these standards are too strict for me. Either I find a
> >simple workaround in Python3 to work with such texts or find a different
> >tool. I cannot avoid such files because my reality is much more complex
> >than strict text/binary dichotomy in Python3.
> 
> I was not declaring your file not to be a "text file" from any
> definition obtained from Python3 documentation, just from a common
> sense definition of "text file".

   And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.

> Looking at it from Python3, though, it is clear that when opening a
> file in "text" mode, an encoding may be specified or will be
> assumed.  That is one encoding, applying to the whole file, not 3
> encodings, with declarations on when to switch between them. So I
> think, in general, Python3 assumes or defines a definition of text
> file that matches my "common sense" definition.

   I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.

> On the other hand, Python3 provides various facilities for working
> with such files.
> 
> The first I'll mention is the one that follows from my description
> of what your file really is: Python3 allows opening files in binary
> mode, and then decoding various sections of it using whatever
> encoding you like, using the bytes.decode() operation on various
> sections of the file. Determination of which sections are in which
> encodings is beyond the scope of this description of the technique,
> and is application dependent.

   This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
   But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.

   Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
("C:\Audio\some.mp3").
   Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
   Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.

> The second is to specify an error handler, that, like you, is
> trained to recognize the other encodings and convert them
> appropriately. I'm not aware that such an error handler has been or
> could be written, myself not having your training.
> 
> The third is to specify the UTF-8 with the surrogate escape error
> handler. This allows non-UTF-8 codes to be loaded into memory. You,
> or algorithms as smart as you, could perhaps be developed to detect
> and manipulate the resulting "lone surrogate" codes in meaningful
> ways, or could simply allow them to ride along without
> interpretation, and be emitted as the original, into other files.

   Yes, these are different workarounds.

Oleg.
-- 
 Oleg Broytman

Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

That's not a text file. That's a binary file containing (hopefully
delimited, and documented) sections of encoded text in different
encodings.

Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

Oleg.


I was not declaring your file not to be a "text file" from any 
definition obtained from Python3 documentation, just from a common sense 
definition of "text file".


Looking at it from Python3, though, it is clear that when opening a file 
in "text" mode, an encoding may be specified or will be assumed.  That 
is one encoding, applying to the whole file, not 3 encodings, with 
declarations on when to switch between them. So I think, in general, 
Python3 assumes or defines a definition of text file that matches my 
"common sense" definition. Also, if it is an HTML file, I doubt the 
browser will use multiple different encodings when interpreting it, so 
it is not clear that the file is of practical use for its intended 
purpose if it contains text in multiple different encodings, but is 
served using only a single encoding, unless there is javascript or some 
programming in the browser that reencodes the data.


On the other hand, Python3 provides various facilities for working with 
such files.


The first I'll mention is the one that follows from my description of 
what your file really is: Python3 allows opening files in binary mode, 
and then decoding various sections of it using whatever encoding you 
like, using the bytes.decode() operation on various sections of the 
file. Determination of which sections are in which encodings is beyond 
the scope of this description of the technique, and is application 
dependent.


The second is to specify an error handler, that, like you, is trained to 
recognize the other encodings and convert them appropriately. I'm not 
aware that such an error handler has been or could be written, myself 
not having your training.


The third is to specify the UTF-8 with the surrogate escape error 
handler. This allows non-UTF-8 codes to be loaded into memory. You, or 
algorithms as smart as you, could perhaps be developed to detect and 
manipulate the resulting "lone surrogate" codes in meaningful ways, or 
could simply allow them to ride along without interpretation, and be 
emitted as the original, into other files.


There may be other technique that I am not aware of.

Glenn
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman 
 wrote:
> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >What encoding does have a text file (an HTML, to be precise) with
> >text in utf-8, ads in cp1251 (ad blocks were included from different
> >files) and comments in koi8-r?
> >Well, I must admit the HTML was rather an exception, but having a
> >text file with some strange characters (binary strings, or paragraphs
> >in different encodings) is not that exceptional.
> That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.

   Allow me to disagree. For me, this is a text file which I can (and
do) view with a pager, edit with a text editor, list on a console,
search with grep and so on. If it is not a text file by strict Python3
standards then these standards are too strict for me. Either I find a
simple workaround in Python3 to work with such texts or find a different
tool. I cannot avoid such files because my reality is much more complex
than strict text/binary dichotomy in Python3.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Glenn Linderman

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully 
delimited, and documented) sections of encoded text in different encodings.


If it is named .html and served by the server as UTF-8, then the server 
is misconfigured, or the file is incorrectly populated.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Oleg Broytman
Hi!

On Sat, Aug 23, 2014 at 01:19:14AM +1000, Steven D'Aprano  
wrote:
> On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> > On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal 
> >  wrote:
> > > This brings up the other key problem. If file names are (almost)
> > > arbitrary bytes, how do you write one to/read one from a text file
> > > with a particular encoding? ( or for that matter display it on a
> > > terminal)
> > 
> >There is no such thing as an encoding of text files.
> 
> I don't understand this comment. It seems to me that *text* files have 
> to have an encoding, otherwise you can't interpret the contents as text. 

   What encoding does have a text file (an HTML, to be precise) with
text in utf-8, ads in cp1251 (ad blocks were included from different
files) and comments in koi8-r?
   Well, I must admit the HTML was rather an exception, but having a
text file with some strange characters (binary strings, or paragraphs
in different encodings) is not that exceptional.

> Files, of course, only contain bytes, but to be treated as bytes you 
> need some way of transforming byte N to char C (or multiple bytes to C), 
> which is an encoding.

   But you don't need to treat the entire file in one encoding. Strange
characters are clearly visible so you can interpret them differently. I
am very much trained to distinguish koi8, cp1251 and utf-8 texts; I
cannot translate them mentally but I can recognize them.

> Perhaps you just mean that encodings are not recorded in the text file 
> itself?

   Yes, that too.

> To answer Chris' question, you typically cannot include arbitrary 
> bytes in text files, and displaying them to the user is likewise 
> problematic

   As a person who view utf-8 files in koi8 fonts (and vice versa) every
day I'd argue. (-:

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Martin v. Löwis
Am 22.08.14 01:56, schrieb Glenn Linderman:
> 0 and 47 are certainly originally derived from ASCII.  However, there
> could be lots of encodings that are not ASCII compatible (but in
> practice, probably very few, since most encodings _are_ ASCII
> compatible) that could be fit those constraints.
> 
> So while as a technical matter, Cameron is correct that Unix only treats
> 0 & 47 as special, and that is insufficient to declare that encodings
> must be ASCII compatible, as a practical matter, since most encodings
> are ASCII compatible anyway, it would be hard to find very many that
> could be used successfully with Unix file names that are not ASCII
> compatible, that could comply with the 0 & 47 requirements.

More importantly, existing encodings that are distinctively *not*
ASCII compatible (e.g. the EBCDIC ones) do not put the slash into 47
(instead, it is at 91 at EBCDIC, 47 is the BEL control character).

There are boundary cases, of course. VISCII is "mostly ASCII
compatible", putting graphic characters into some of the control
characters, but using those that aren't used in ASCII, anyway.

And then there is the YUSCII family of encodings, which definitely
is not ASCII compatible, as it does not contain Latin characters,
but still puts the / into 47 (and also keeps the ASCII digits and
special characters in their positions). There is also SI 960, which
has the slash, the ASCII uppercase letters, digits and special
characters, but replaces the lower-case characters with Hebrew.

So yes, Unix doesn't mandate ASCII-compatible encodings; but it
still mandates ASCII-inspired encodings. I wonder how you would
run "gcc", though, on an SI 960 system; you'ld have to type
חדד.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-22 Thread Steven D'Aprano
On Fri, Aug 22, 2014 at 04:42:29AM +0200, Oleg Broytman wrote:
> On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal 
>  wrote:
> > This brings up the other key problem. If file names are (almost)
> > arbitrary bytes, how do you write one to/read one from a text file
> > with a particular encoding? ( or for that matter display it on a
> > terminal)
> 
>There is no such thing as an encoding of text files.

I don't understand this comment. It seems to me that *text* files have 
to have an encoding, otherwise you can't interpret the contents as text. 
Files, of course, only contain bytes, but to be treated as bytes you 
need some way of transforming byte N to char C (or multiple bytes to C), 
which is an encoding.

Perhaps you just mean that encodings are not recorded in the text file 
itself?

To answer Chris' question, you typically cannot include arbitrary 
bytes in text files, and displaying them to the user is likewise 
problematic. The usual solution is to support some form of 
escaping, like \t #x0A; or %0D, to give a few examples.


-- 
Steven
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Marko Rauhamaa
Nick Coghlan :

> Python 3 says it's *our* problem to deal with on behalf of our
> developers.

http://www.imdb.com/title/tt0120623/quotes?item=qt0353406>

Flik: I was just trying to help.

Mr. Soil: Then help us; *don't* help us.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Stephen J. Turnbull
Chris Barker - NOAA Federal writes:

 > This brings up the other key problem. If file names are (almost)
 > arbitrary bytes, how do you write one to/read one from a text file
 > with a particular encoding? ( or for that matter display it on a
 > terminal)

"Very carefully."

But this is strictly from need.  *Nobody* (with the exception of the
crackers who like to name their programs things like "\u0007") *wants*
to do this.  Real people want to name their files in some human
language they understand, and spell it in the usual way, and encode
those characters as bytes in the usual way.

Decoding those characters in the usual way and getting nonsense is the
exceptional case, and it must be the application's or user's problem
to decide what to do.  They know where they got the file from and
usually have some idea of what its name should look like.  Python
doesn't, so Python cannot solve it for them.

For that reason, I believe that Python's "normal"/high-level approach
to file handling should treat file names as (human-oriented) text.  Of
course Python should be able to handle bytes straight from the disk,
but most programmers shouldn't have to.

 > And people still want to say posix isn't broken in this regard?

Deal with it, bro'.





___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Oleg Broytman
On Thu, Aug 21, 2014 at 05:30:14PM -0700, Chris Barker - NOAA Federal 
 wrote:
> This brings up the other key problem. If file names are (almost)
> arbitrary bytes, how do you write one to/read one from a text file
> with a particular encoding? ( or for that matter display it on a
> terminal)

   There is no such thing as an encoding of text files. So we just
write those bytes to the file or output them to the terminal. I often do
that. My filesystems are full of files with names and content in
at least 3 different encodings - koi8-r, utf-8 and cp1251. So I open a
terminal with koi8 or utf-8 locale and fonts and some file always look
weird. But however weird they are it's possible to work with them.

   The bigger problem is line feeds. A filename with linefeeds can be
put to a text file, but cannot be read back. So one has to transform
such names. Usually s/\\//g and s/\n/\\n/g is enough. (-:

> And people still want to say posix isn't broken in this regard?

   Not at all! And broken or not broken it's what I (for many different
reasons) prefer to use for my desktops, servers, notebooks, routers and
smartphones, so if Python would stand on my way I'd rather switch to a
different tools.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Chris Barker - NOAA Federal
> Does Unix even support UTF-16 as an encoding? I suppose, these days, it 
> probably does, for reading contents of files created on Windows, etc.

I don't think Unix supports any encodings at all for the _contents_ of
files -- that's up to applications. Of course the command line text
processing tools need to know -- I'm guessing those are never going to
work w/UTF-16!

"System encoding" is a nice idea, but pretty much worthless. Only
helpful for files created and processed on the same system -- not rare
for that not to be the case.

This brings up the other key problem. If file names are (almost)
arbitrary bytes, how do you write one to/read one from a text file
with a particular encoding? ( or for that matter display it on a
terminal)

And people still want to say posix isn't broken in this regard?

Sigh.

-Chris
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Oleg Broytman
On Thu, Aug 21, 2014 at 05:00:02PM -0700, Glenn Linderman 
 wrote:
> On 8/21/2014 3:42 PM, Paul Moore wrote:
> >I wonder how badly a Unix system would break if you specified UTF16 as
> >the system encoding...?
> 
> Does Unix even support UTF-16 as an encoding?

   As an encoding of file's content? Certainly yes. As a locale
encoding? Definitely no.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Glenn Linderman

On 8/21/2014 3:54 PM, Antoine Pitrou wrote:


Le 21/08/2014 18:27, Cameron Simpson a écrit :

As
remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX
filename bytes strings.


So you admit that POSIX mandates that file paths are expressed in an 
ASCII-compatible encoding after all? Good. I've nothing to add to your 
rant.


Antoine.


0 and 47 are certainly originally derived from ASCII.  However, there 
could be lots of encodings that are not ASCII compatible (but in 
practice, probably very few, since most encodings _are_ ASCII 
compatible) that could be fit those constraints.


So while as a technical matter, Cameron is correct that Unix only treats 
0 & 47 as special, and that is insufficient to declare that encodings 
must be ASCII compatible, as a practical matter, since most encodings 
are ASCII compatible anyway, it would be hard to find very many that 
could be used successfully with Unix file names that are not ASCII 
compatible, that could comply with the 0 & 47 requirements.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Glenn Linderman

On 8/21/2014 3:42 PM, Paul Moore wrote:

I wonder how badly a Unix system would break if you specified UTF16 as
the system encoding...?
Paul


Does Unix even support UTF-16 as an encoding? I suppose, these days, it 
probably does, for reading contents of files created on Windows, etc. 
(Unicode was just gaining traction when I last used Unix in a 
significant manner; yes, my web host runs Linux, and I know enough to do 
what can be done there... but haven't experimented with encodings other 
than ASCII & UTF-8 on the web host, and don't intend to).


If it allows configuration of UTF-16 or UTF-32 as system encodings, I 
would consider that a bug, though, as too much of Unix predates Unicode, 
and would be likely to fail.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Nick Coghlan
On 22 Aug 2014 09:24, "Isaac Morland"  wrote:
> I think the real tension here is between the POSIX level where filenames
are byte strings (except for \x00, which is reserved for string
termination) where \x2F has special interpretation, and absolutely every
application ever written, in every language, which wants filenames to be
character strings.

That's one of the best summaries of the situation I've ever seen :)

Most languages (including Python 2) throw up their hands and say this is
the developer's problem to deal with. Python 3 says it's *our* problem to
deal with on behalf of our developers. The "surrogateescape" error handler
allows recalcitrant bytes to be dealt with relatively gracefully in most
situations. We don't quite cover *everything* yet (hence the complaints
from some of the folks that are experts at dealing with Python 2 Unicode
handling on POSIX systems), but the remaining problems are a lot more
tractable than the "teach every native English speaker everywhere how to
handle Unicode properly" problem.

Regards,
Nick.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Isaac Morland

On Thu, 21 Aug 2014, Chris Barker wrote:


so they are "just byte strings", oh, except that you can't have a  null, and
the "slash" had better be code 47 (and vice versa). How is that different
than "bytes-in-some-arbitrary-encoding-where-at-least
the-slash-character-is-ascii-compatible"?


Actually, slash doesn't need to be code 47.  But no matter what code 47 
means outside of the context of a filename, it is the path arc separator 
byte (not character).


In fact, this isn't even entirely academic.  On a Mac OS X machine, go 
into Finder and try to create a directory called ":".  You'll get an error 
saying 'The name “:” can’t be used.'.  Now create a directory called "/". 
No problem, raising the question of what is going on at the filesystem 
level?


Answer:

$ ls -al
total 0
drwxr-xr-x   3 ijmorlan  staff   102 21 Aug 18:57 ./
drwxr-xr-x+ 80 ijmorlan  staff  2720 21 Aug 18:57 ../
drwxr-xr-x   2 ijmorlan  staff68 21 Aug 18:57 :/

And of course in shell one would remove the directory with this:

rm -rf :

not:

rm -rf /

So in effect the file system path arc encoding on Mac OS X is UTF-8 
*except* that : is outlawed and / is encoded as \x3A rather than the usual 
\x2F.  Of course, the path arc separator byte (not character) remains \x2F 
as always.


Just for fun, there are contexts in which one can give a full path at the 
GUI level, where : is used as the path separator.  This is for historical 
reasons and presumably is the reason for the above-noted behaviour.


I think the real tension here is between the POSIX level where filenames 
are byte strings (except for \x00, which is reserved for string 
termination) where \x2F has special interpretation, and absolutely every 
application ever written, in every language, which wants filenames to be 
character strings.


Isaac Morland   CSCF Web Guru
DC 2554C, x36650WWW Software Specialist___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Antoine Pitrou


Le 21/08/2014 18:27, Cameron Simpson a écrit :

As
remarked, codes 0 (NUL) and 47 (ASCII slash code) _are_ special to UNIX
filename bytes strings.


So you admit that POSIX mandates that file paths are expressed in an 
ASCII-compatible encoding after all? Good. I've nothing to add to your rant.


Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Paul Moore
On 21 August 2014 23:27, Cameron Simpson  wrote:
> That's not "ASCII compatible". That's "not all byte codes can be freely used
> without thought", and any multibyte coding will have to consider such things
> when embedding itself in another coding scheme.

I wonder how badly a Unix system would break if you specified UTF16 as
the system encoding...?
Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Chris Barker
On Wed, Aug 20, 2014 at 9:52 PM, Cameron Simpson  wrote:

> On 20Aug2014 16:04, Chris Barker - NOAA Federal 
> wrote:
>
>>

>  So really, people treat them as
>>>
>> "bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
>> maybe a couple others)-is-ascii-compatible"
>>
>
> As someone who fought long and hard in the surrogate-escape listdir()
> wars, and was won over once the scheme was thoroughly explained to me, I
> take issue with these assertions: they are bogus or misleading.
>
> Firstly, POSIX filenames _are_ just byte strings. The only forbidden
> character is the NUL byte, which terminates a C string, and the only
> special character is the slash, which separates pathanme components.
>

so they are "just byte strings", oh, except that you can't have a  null,
and the "slash" had better be code 47 (and vice versa). How is that
different than "bytes-in-some-arbitrary-encoding-where-at-least
the-slash-character-is-ascii-compatible"?

(sorry about the "maybe a couple others", I was too lazy to do my research
and be sure).

But my point is that python users want to be able to work with paths, and
paths on posix are not strictly strings with a clearly defined encoding,
but they are also not quite "just arbitrary bytes". So it would be nice if
we could have a pathlib that would work with these odd beasts. I've lost
track a bit as to whether the surrogate-escape solution allows this to all
work now. If it does, then great, sorry for the noise.

Second, a bare low level program cannot do _much_ more than pass them
> around.  It certainly can do things like compute their basename, or other
> path related operations.
>

only if you assume that pesky slash == 47 thing -- it's not much, but it's
not raw bytes either.

The "bytes in some arbitrary encoding where at least the slash character
> (and
> maybe a couple others) is ascii compatible" notion is completely bogus.
> There's only one special byte, the slash (code 47). There's no OS-level
> need that it or anything else be ASCII compatible. I think
> characterizations such as the one quoted are activately misleading.
>

code 47 == "slash" is ascii compatible -- where else did the 47 value come
from?


> I think we'd all agree it is nice to have a system where filenames are all
> Unicode, but since POSIX/UNIX predates it by decades it is a bit late to
> ignore the reality for such systems.


well, the community could have gone to "if you want anything other than
ascii, make it utf-8 -- but always, we're all a bunch of independent
thinkers.

But none of this is relevant -- systems in the wild do what they do --
clearly we all want Python to work with them as best it can.


> There's no _external_ "filesystem encoding" in the sense of something
> recorded in the filesystem that anyone can inspect. But there is the
> expressed locale settings, available at runtime to any program that cares
> to pay attention. It is a workable situation.
>

I haven't run into it, but it seem the folks that have don't think relying
on the locale setting is the least bit workable. If it were, we woldn't be
havin this discussion -- use the locale setting to decide how to decode
filenames -- done.

Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he wants.
> (Indeed, what I want, and I'm a long time UNIX fanboy.)
>

bug or feature? you decide. Internal consistency is a good start, but it
punts the whole encoding issue to the client software, without giving it
the tools to do it right. I call that "really hard to work with" if not
broken.

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Cameron Simpson

On 21Aug2014 09:20, Antoine Pitrou  wrote:

Le 21/08/2014 00:52, Cameron Simpson a écrit :

The "bytes in some arbitrary encoding where at least the slash character
(and
maybe a couple others) is ascii compatible" notion is completely bogus.
There's only one special byte, the slash (code 47). There's no OS-level
need that it or anything else be ASCII compatible.


Of course there is. Try to split an UTF-16-encoded file path on the 
byte 47 and you'll get a lot of garbage. So, yes, POSIX implicitly 
mandates an ASCII-compatible encoding for file paths.


[Rolls eyes.] Looking at the UTF-16 encoding, it looks like it also embeds NUL 
bytes for various codes below 32768. How are they handled? As remarked, codes 0 
(NUL) and 47 (ASCII slash code) _are_ special to UNIX filename bytes strings.


If you imagine you can embed bare UTF-16 freely even excluding code 47, I think 
one of us is missing something.


That's not "ASCII compatible". That's "not all byte codes can be freely used 
without thought", and any multibyte coding will have to consider such things 
when embedding itself in another coding scheme.


Cheers,
Cameron Simpson 

Microsoft:  Committed to putting the "backward" into "backward compatibility."
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Stephen J. Turnbull
Marko Rauhamaa writes:

 > My point is that the poor programmer cannot ignore the possibility of
 > "funny" character sets.

*Poor* programmers do it all the time.  That's why Python codecs raise
when they encounter bytes they can't handle.

 > If Python tried to protect the programmer from that possibility,

I don't understand your point.  The existing interfaces aren't going
anywhere, and they're enough to do anything you need to do.  Although
there are a few radicals (like me in a past life :-) who might like to
see them go away in favor of opt-in to binary encoding via
surrogateescape error handling, nobody in their right mind supports
us.

The question here is not about going backward, it's about whether to
add new bytes APIs, and which ones.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Nick Coghlan
On 22 August 2014 00:12, Nick Coghlan  wrote:
> On 21 August 2014 23:58, Marko Rauhamaa  wrote:
>>
>> My point is that the poor programmer cannot ignore the possibility of
>> "funny" character sets. If Python tried to protect the programmer from
>> that possibility, the result might be even more intractable: how to act
>> on a file with an non-UTF-8 filename if you are unable to express it as
>> a text string?
>
> That's what the "surrogateescape" codec is for

Oops, that should say "codec error handled" (I got it right later in the post).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Nick Coghlan
On 21 August 2014 23:58, Marko Rauhamaa  wrote:
>
> My point is that the poor programmer cannot ignore the possibility of
> "funny" character sets. If Python tried to protect the programmer from
> that possibility, the result might be even more intractable: how to act
> on a file with an non-UTF-8 filename if you are unable to express it as
> a text string?

That's what the "surrogateescape" codec is for - we use it by default
on most OS interfaces, and it's implicit in the use of "os.fsencode"
and "os.fsdecode". Starting with Python 3, it's also enabled on
sys.stdout by default, so that "print(os.listdir(dirname))" will pass
the original raw bytes through to the terminal the same way Python 2
does.

The docs could use additional details as to which interfaces do and
don't have surrogateescape enabled by default, but for the time being,
the description of the codec error handler just links out to the
original definition in PEP 383.

It may also be useful to have some tools for detecting and cleaning
strings containing surrogate escaped data, but there hasn't been a
concrete proposal along those lines as yet. Personally, I'm currently
waiting to see if the Fedora or OpenStack folks indicate a need for
such tools before proposing any additions.

Regards,
Nick.

>
>
> Marko
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com



-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Marko Rauhamaa
"Martin v. Löwis" :

> I think the people defending the "Unix file names are just bytes" side
> often miss an important detail: displaying file names to the user, and
> allowing the user to enter file names.

The user interface is a real issue and needs to be addressed. It is
separate from the OS interface, though.

> A script that just needs to traverse a directory tree and look at
> files by certain criteria can easily do so with not worrying about a
> text interpretation of the file names.

A single system often has file names that have been encoded with
different schemes. Only today, I have had to deal with the JIS character
table (http://i.msdn.microsoft.com/cc305152.932%28en-us,MSDN.10%29.gif>) -- you
will notice that it doesn't have a backslash character. A coworker uses
ISO-8859-1.

I use UTF-8. UTF-8, of course, will refuse to deal with some byte
sequences.

My point is that the poor programmer cannot ignore the possibility of
"funny" character sets. If Python tried to protect the programmer from
that possibility, the result might be even more intractable: how to act
on a file with an non-UTF-8 filename if you are unable to express it as
a text string?


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Antoine Pitrou


Le 21/08/2014 00:52, Cameron Simpson a écrit :


The "bytes in some arbitrary encoding where at least the slash character
(and
maybe a couple others) is ascii compatible" notion is completely bogus.
There's only one special byte, the slash (code 47). There's no OS-level
need that it or anything else be ASCII compatible.


Of course there is. Try to split an UTF-16-encoded file path on the byte 
47 and you'll get a lot of garbage. So, yes, POSIX implicitly mandates 
an ASCII-compatible encoding for file paths.


Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Nick Coghlan
On 21 August 2014 14:52, Cameron Simpson  wrote:
>
> Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he wants.
> (Indeed, what I want, and I'm a long time UNIX fanboy.)

The part that is broken is the idea that locale encodings are a viable
solution to conveying the appropriate encoding to use to talk to the
operating system. We've tried trusting them with Python 3, and they're
reliably wrong in certain situations. systemd is apparently better
than upstart at setting them correctly (e.g. for cron jobs), but even
it can't defend against an erroneous (or deliberate!) "LANG=C", or ssh
environment forwarding pushing a client's locale to the server. It's
worth looking through some of Armin Ronacher's complaints about Python
3 being broken on Linux, and seeing how many of them boil down to
"trusting the locale is wrong, Python 3 should just assume UTF-8 on
every POSIX system, the same way it does on Mac OS X". (I suspect
ShiftJIS, ISO-2022, et al users might object to that approach, but
it's at least a more viable choice now than it was back in 2008)

I still think we made the right call at least *trying* the idea of
trusting the locale encoding (since that's the officially supported
way of getting this information from the OS), and in many, many
situations it works fine. But I suspect we may eventually need to
resolve the technical issues currently preventing us from deciding to
ignore the environmental locale during interpreter startup and try
something different (such as always assuming UTF-8, or trying to force
C.UTF-8 if we detect the C locale, or looking for the systemd config
files and using those to set the OS encoding, rather than the
environmental locale).

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Martin v. Löwis
Am 19.08.14 19:43, schrieb Ben Hoyt:
 The official policy is that we want them [support for bytes paths in 
 stdlib functions] to go away, but reality so far has not budged. We will 
 continue to hold our breath though. :-)
>>>
>>> Does that mean that new APIs should explicitly not support bytes? I'm
>>> thinking of os.scandir() (PEP 471), which I'm implementing at the
>>> moment. I was originally going to make it support bytes so it was
>>> compatible with listdir, but maybe that's a bad idea. Bytes paths are
>>> essentially broken on Windows.
>>
>> Bytes paths are "essential" on Unix, though, so I don't think we should
>> create new low-level APIs that don't support bytes.
> 
> Fair enough. I don't quite understand, though -- why is the "official
> policy" to kill something that's "essential" on *nix?

I think the people defending the "Unix file names are just bytes" side
often miss an important detail: displaying file names to the user, and
allowing the user to enter file names.

A script that just needs to traverse a directory tree and look at files
by certain criteria can easily do so with not worrying about a text
interpretation of the file names.

When it comes to user interaction, it becomes apparent that, even on
Unix, file names are not just bytes. If you do "ls -l" in your shell,
the "system" (not just the kernel - but ultimately the terminal program,
which might be the console driver, or an X11 application) will interpret
the file name as having an encoding, and render them with a font.

So for Python, the question is: which of the use cases (processing
all files, vs. showing them to the user) should be better supported?
Python 3 took the latter as an answer, under the assumption that this
is the more common case.

Regards,
Martin


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Nick Coghlan
On 21 August 2014 12:16, Stephen J. Turnbull  wrote:
> Nick Coghlan writes:
>
>  > One idea I had along those lines is a surrogatereplace error handler (
>  > http://bugs.python.org/issue22016) that emitted an ASCII question mark for
>  > each smuggled byte, rather than propagating the encoding problem.
>
> Please, don't.
>
> "Smuggled bytes" are not independent events.  They tend to be
> correlated *within* file names, and this handler would generate names
> whose human semantics get lost (and there *are* human semantics,
> otherwise the name would be str(some_counter)).  They tend to be
> correlated across file names, and this handler will generate multiple
> files with the same munged name (and again, the differentiating human
> semantics get lost).
>
> If you don't know the semantics of the intended file names, you can't
> generate good replacement names.  This has to be an application-level
> function, and often requires user intervention to get good names.
>
> If you want to provide helper functions that applications can use to
> clean names explicitly, that might be OK.

Yeah, I was thinking in the context of reproducing sys.stdout's
behaviour in Python 2, but that reproduces the bytes faithfully, so
'surrogateescape' is already offers exactly the behaviour we want
(sys.stdout will have surrogateescape enabled by default in 3.5).

I'll keep pondering the question of possible helper functions in the
"string" module.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-21 Thread Oleg Broytman
Hi!

On Thu, Aug 21, 2014 at 02:52:19PM +1000, Cameron Simpson  
wrote:
> Oh, and I reject Nick's characterisation of POSIX as "broken". It's
> perfectly internally consistent. It just doesn't match what he
> wants. (Indeed, what I want, and I'm a long time UNIX fanboy.)
> 
> Cheers,
> Cameron Simpson 

   +1 from another Unix fanboy. Like an old wine, Unix becomes better
with years! ;-)

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Cameron Simpson

On 20Aug2014 16:04, Chris Barker - NOAA Federal  wrote:

 but disallowing them in higher level

> explicitly cross platform abstractions like pathlib.



I think the trick here is that posix-using folks claim that filenames are
just bytes, and indeed they can be passed around with a char*, so they seem
to be.

but you can't possible do anything other than pass them around if you
REALLY think they are just bytes.

So really, people treat them as
"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
maybe a couple others)-is-ascii-compatible"


As someone who fought long and hard in the surrogate-escape listdir() wars, and 
was won over once the scheme was thoroughly explained to me, I take issue with 
these assertions: they are bogus or misleading.


Firstly, POSIX filenames _are_ just byte strings. The only forbidden character 
is the NUL byte, which terminates a C string, and the only special character is 
the slash, which separates pathanme components.


Second, a bare low level program cannot do _much_ more than pass them around.  
It certainly can do things like compute their basename, or other path related 
operations.


The "bytes in some arbitrary encoding where at least the slash character (and
maybe a couple others) is ascii compatible" notion is completely bogus. There's 
only one special byte, the slash (code 47). There's no OS-level need that it or 
anything else be ASCII compatible. I think characterisations such as the one 
quoted are activately misleading.


The way you get UTF-8 (or some other encoding, fortunately getting less and 
less common) is by convention: you decide in your environment to work in some 
encoding (say utf-8) via the locale variables, and all your user-facing text 
gets used in UTF-8 encoding form when turned into bytes for the filename calls 
because your text<->bytes methods say to do so.


I think we'd all agree it is nice to have a system where filenames are all 
Unicode, but since POSIX/UNIX predates it by decades it is a bit late to ignore 
the reality for such systems. I certainly think the Window-side Babel of code 
pages and multiple code systems is far far worse. (Disclaimer: not a Windows 
programmer, just based on hearing them complain.)


I'm +1000 on systems where the filesystem enforces Unicode (eg Plan 9 or Mac 
OSX, which forces a specific UTF-8 encoding in the bytes POSIX APIs - the 
underlying filesystems reject invalid byte sequences).


[...]

Antoine Pitrou wrote:

To elaborate specifically about pathlib, it doesn't handle bytes paths
but allows you to generate them if desired:
https://docs.python.org/3/library/pathlib.html#operators


but that uses
os.fsencode:  Encode filename to the filesystem encoding

As I understand it, the whole problem with some posix systems is that there
is NO filesystem encoding -- i.e. you can't know for sure what encoding a
filename is in. So you need to be able to pass the bytes through as they
are.


Yes and no. I made that argument too.

There's no _external_ "filesystem encoding" in the sense of something recorded 
in the filesystem that anyone can inspect. But there is the expressed locale 
settings, available at runtime to any program that cares to pay attention. It 
is a workable situation.


Oh, and I reject Nick's characterisation of POSIX as "broken". It's perfectly 
internally consistent. It just doesn't match what he wants. (Indeed, what I 
want, and I'm a long time UNIX fanboy.)


Cheers,
Cameron Simpson 

God is real, unless declared integer.   - Johan Montald, jo...@ingres.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Stephen J. Turnbull
Nick Coghlan writes:

 > One idea I had along those lines is a surrogatereplace error handler (
 > http://bugs.python.org/issue22016) that emitted an ASCII question mark for
 > each smuggled byte, rather than propagating the encoding problem.

Please, don't.

"Smuggled bytes" are not independent events.  They tend to be
correlated *within* file names, and this handler would generate names
whose human semantics get lost (and there *are* human semantics,
otherwise the name would be str(some_counter)).  They tend to be
correlated across file names, and this handler will generate multiple
files with the same munged name (and again, the differentiating human
semantics get lost).

If you don't know the semantics of the intended file names, you can't
generate good replacement names.  This has to be an application-level
function, and often requires user intervention to get good names.

If you want to provide helper functions that applications can use to
clean names explicitly, that might be OK.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Ben Hoyt
>> If scandir is low-level, and the low-level API's are the ones that should
>> support bytes paths, then scandir should support bytes paths.
>>
>> Is that what you meant to say?
>
> Yes. The discussions around PEP 471 *deferred* discussions of bytes
> and file descriptor support to their own RFEs (not needing a PEP),
> they didn't decide definitively not to support them. So Serhiy's
> thread is entirely pertinent to that question.
>
> Note that adding bytes support still *should not* hold up the initial
> PEP 471 implementation - it should be done as a follow on RFE.

I agree with this (that scandir is low level and should support
bytes). As it happens, I'm implementing bytes support as well -- what
with the path_t support in posixmodule.c and the listdir
implementation to go on, it's not really any harder. So I think we'll
have it right off the bat.

BTW, the Windows implementation of PEP 471 is basically done, and the
POSIX implementation is written but not working yet. And then there's
tests and docs.

-Ben
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Ethan Furman

On 08/20/2014 05:15 PM, Nick Coghlan wrote:

On 21 August 2014 09:33, Ethan Furman  wrote:

On 08/20/2014 03:31 PM, Nick Coghlan wrote:


scandir is low level (the entire os module is low level). In fact, aside
from pathlib, I'd consider pretty much every
API we have that deals with paths to be low level - that's a large part of
the reason we needed pathlib!


If scandir is low-level, and the low-level API's are the ones that should
support bytes paths, then scandir should support bytes paths.

Is that what you meant to say?


Yes. The discussions around PEP 471 *deferred* discussions of bytes
and file descriptor support to their own RFEs (not needing a PEP),
they didn't decide definitively not to support them. So Serhiy's
thread is entirely pertinent to that question.


Thanks for clearing that up.  I hate feeling confused.  ;)



Note that adding bytes support still *should not* hold up the initial
PEP 471 implementation - it should be done as a follow on RFE.


Agreed.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Nick Coghlan
On 21 August 2014 09:33, Ethan Furman  wrote:
> On 08/20/2014 03:31 PM, Nick Coghlan wrote:
>> On 21 Aug 2014 08:19, "Greg Ewing" > > wrote:
>>>
>>>
>>> Antoine Pitrou wrote:


 I think if you want low-level features (such as unconverted bytes paths
 under POSIX), it is reasonable to point you to low-level APIs.
>>>
>>>
>>>
>>> The problem with scandir() in particular is that there is
>>> currently *no* low-level API exposed that gives the same
>>> functionality.
>>>
>>> If scandir() is not to support bytes paths, I'd suggest
>>> exposing the opendir() and readdir() system calls with
>>> bytes path support.
>>
>>
>> scandir is low level (the entire os module is low level). In fact, aside
>> from pathlib, I'd consider pretty much every
>> API we have that deals with paths to be low level - that's a large part of
>> the reason we needed pathlib!
>
>
> If scandir is low-level, and the low-level API's are the ones that should
> support bytes paths, then scandir should support bytes paths.
>
> Is that what you meant to say?

Yes. The discussions around PEP 471 *deferred* discussions of bytes
and file descriptor support to their own RFEs (not needing a PEP),
they didn't decide definitively not to support them. So Serhiy's
thread is entirely pertinent to that question.

Note that adding bytes support still *should not* hold up the initial
PEP 471 implementation - it should be done as a follow on RFE.

Cheers,
Nick.


-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Ethan Furman

On 08/20/2014 03:31 PM, Nick Coghlan wrote:


On 21 Aug 2014 08:19, "Greg Ewing" mailto:greg.ew...@canterbury.ac.nz>> wrote:


Antoine Pitrou wrote:


I think if you want low-level features (such as unconverted bytes paths under 
POSIX), it is reasonable to point you to low-level APIs.



The problem with scandir() in particular is that there is
currently *no* low-level API exposed that gives the same
functionality.

If scandir() is not to support bytes paths, I'd suggest
exposing the opendir() and readdir() system calls with
bytes path support.


scandir is low level (the entire os module is low level). In fact, aside from 
pathlib, I'd consider pretty much every
API we have that deals with paths to be low level - that's a large part of the 
reason we needed pathlib!


If scandir is low-level, and the low-level API's are the ones that should support bytes paths, then scandir should 
support bytes paths.


Is that what you meant to say?

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Nick Coghlan
On 21 Aug 2014 09:06, "Chris Barker"  wrote:

>
> As I understand it, the whole problem with some posix systems is that
there is NO filesystem encoding -- i.e. you can't know for sure what
encoding a filename is in. So you need to be able to pass the bytes through
as they are.
>
> (At least as I read Armin Ronacher's blog)

Armin lets his astonishment at the idea we'd expect Linux vendors to fix
their broken OS get the better of him at times - he thinks the
responsibility lies entirely with us to work around its quirks and
limitations :)

The "surrogateescape" codec is our main answer to the unreliability of the
POSIX encoding model - fsdecode will squirrel away arbitrary bytes in the
private use area, and then fsencode will restore them again later. That
works for the simple round tripping case, but we currently lack good
default tools for "cleaning" strings that may contain surrogates (or even
scanning a string to see if surrogates are present).

One idea I had along those lines is a surrogatereplace error handler (
http://bugs.python.org/issue22016) that emitted an ASCII question mark for
each smuggled byte, rather than propagating the encoding problem.

Cheers,
Nick.

>
> -Chris
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Chris Barker
>
>  but disallowing them in higher level
>> > explicitly cross platform abstractions like pathlib.
>>
>
I think the trick here is that posix-using folks claim that filenames are
just bytes, and indeed they can be passed around with a char*, so they seem
to be.

but you can't possible do anything other than pass them around if you
REALLY think they are just bytes.

So really, people treat them as
"bytes-in-some-arbitrary-encoding-where-at-least the-slash-character-(and
maybe a couple others)-is-ascii-compatible"

If you assume that, then you could write a pathlib that would work. And in
practice, I expect a lot of designed only for posix code works that way.
But of course, this gets ugly if you go to a platform where filenames are
not "bytes-in-some-arbitrary-encoding-where-at-least
the-slash-character-(and maybe a couple others)-is-ascii-compatible", like
windows.

I'm not sure if it's worth having a pathlib, etc. that uses this assumption
-- but it could help us all write code that actually works with this screwy
lack of specification.

 Antoine Pitrou wrote:

> To elaborate specifically about pathlib, it doesn't handle bytes paths
> but allows you to generate them if desired:
> https://docs.python.org/3/library/pathlib.html#operators


but that uses

os.fsencode:  Encode filename to the filesystem encoding

As I understand it, the whole problem with some posix systems is that there
is NO filesystem encoding -- i.e. you can't know for sure what encoding a
filename is in. So you need to be able to pass the bytes through as they
are.

(At least as I read Armin Ronacher's blog)

-Chris


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Nick Coghlan
On 21 Aug 2014 08:19, "Greg Ewing"  wrote:
>
> Antoine Pitrou wrote:
>>
>> I think if you want low-level features (such as unconverted bytes paths
under POSIX), it is reasonable to point you to low-level APIs.
>
>
> The problem with scandir() in particular is that there is
> currently *no* low-level API exposed that gives the same
> functionality.
>
> If scandir() is not to support bytes paths, I'd suggest
> exposing the opendir() and readdir() system calls with
> bytes path support.

scandir is low level (the entire os module is low level). In fact, aside
from pathlib, I'd consider pretty much every API we have that deals with
paths to be low level - that's a large part of the reason we needed pathlib!

Cheers,
Nick.

>
> --
> Greg
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Greg Ewing

Antoine Pitrou wrote:
I think if you want low-level features (such as unconverted bytes paths 
under POSIX), it is reasonable to point you to low-level APIs.


The problem with scandir() in particular is that there is
currently *no* low-level API exposed that gives the same
functionality.

If scandir() is not to support bytes paths, I'd suggest
exposing the opendir() and readdir() system calls with
bytes path support.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Terry Reedy

On 8/20/2014 9:01 AM, Antoine Pitrou wrote:

Le 20/08/2014 07:08, Nick Coghlan a écrit :


It's not just the JVM that says text and binary APIs should be separate
- it's every widely used operating system services layer except POSIX.
The POSIX way works well *if* everyone reliably encodes things as UTF-8
or always uses encoding detection, but its failure mode is unfortunately
silent data corruption.

That said, there's a lot of Python software that is POSIX specific,
where bytes paths would be the least of the barriers to porting to
Windows or Jython. I'm personally +1 on consistently allowing binary
paths in lower level APIs, but disallowing them in higher level
explicitly cross platform abstractions like pathlib.


I fully agree with Nick's position here.

To elaborate specifically about pathlib, it doesn't handle bytes paths
but allows you to generate them if desired:
https://docs.python.org/3/library/pathlib.html#operators

Adding full bytes support to pathlib would have added a lot of
complication and fragility in the implementation *and* in the API (is it
allowed to combine str and bytes paths? should they have separate
classes?), for arguably little benefit.


I am glad you did not recreate the madness of pre 3.0 Python in that regard.


I think if you want low-level features (such as unconverted bytes paths
under POSIX), it is reasonable to point you to low-level APIs.


Do our docs somewhere explain the idea that files names are conceptually 
*names*, not arbitrary bytes; explain the concept of low-level versus 
high-level API' and point to the two types of APIs in Python?


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Brett Cannon
On Wed Aug 20 2014 at 9:02:25 AM Antoine Pitrou  wrote:

> Le 20/08/2014 07:08, Nick Coghlan a écrit :
> >
> > It's not just the JVM that says text and binary APIs should be separate
> > - it's every widely used operating system services layer except POSIX.
> > The POSIX way works well *if* everyone reliably encodes things as UTF-8
> > or always uses encoding detection, but its failure mode is unfortunately
> > silent data corruption.
> >
> > That said, there's a lot of Python software that is POSIX specific,
> > where bytes paths would be the least of the barriers to porting to
> > Windows or Jython. I'm personally +1 on consistently allowing binary
> > paths in lower level APIs, but disallowing them in higher level
> > explicitly cross platform abstractions like pathlib.
>
> I fully agree with Nick's position here.
>
> To elaborate specifically about pathlib, it doesn't handle bytes paths
> but allows you to generate them if desired:
> https://docs.python.org/3/library/pathlib.html#operators
>
> Adding full bytes support to pathlib would have added a lot of
> complication and fragility in the implementation *and* in the API (is it
> allowed to combine str and bytes paths? should they have separate
> classes?), for arguably little benefit.
>
> I think if you want low-level features (such as unconverted bytes paths
> under POSIX), it is reasonable to point you to low-level APIs.
>

+1 from me as well. Allowing the low-level stuff work on bytes but keeping
high-level actually high-level keeps with our consenting adults policy as
well as making things possible, but not at the detriment of the common
case.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Antoine Pitrou

Le 20/08/2014 07:08, Nick Coghlan a écrit :


It's not just the JVM that says text and binary APIs should be separate
- it's every widely used operating system services layer except POSIX.
The POSIX way works well *if* everyone reliably encodes things as UTF-8
or always uses encoding detection, but its failure mode is unfortunately
silent data corruption.

That said, there's a lot of Python software that is POSIX specific,
where bytes paths would be the least of the barriers to porting to
Windows or Jython. I'm personally +1 on consistently allowing binary
paths in lower level APIs, but disallowing them in higher level
explicitly cross platform abstractions like pathlib.


I fully agree with Nick's position here.

To elaborate specifically about pathlib, it doesn't handle bytes paths 
but allows you to generate them if desired:

https://docs.python.org/3/library/pathlib.html#operators

Adding full bytes support to pathlib would have added a lot of 
complication and fragility in the implementation *and* in the API (is it 
allowed to combine str and bytes paths? should they have separate 
classes?), for arguably little benefit.


I think if you want low-level features (such as unconverted bytes paths 
under POSIX), it is reasonable to point you to low-level APIs.


Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Nick Coghlan
On 20 Aug 2014 04:18, "Marko Rauhamaa"  wrote:
>
> Tres Seaver :
>
> > On 08/19/2014 01:43 PM, Ben Hoyt wrote:
> >> Fair enough. I don't quite understand, though -- why is the "official
> >> policy" to kill something that's "essential" on *nix?
> >
> > ISTM that the policy is based on a fantasy that "it looks like text to
> > me in my use cases, so therefore it must be text for everyone."
>
> What I like about Python is that it allows me to write native linux code
> without having to make portability compromises that plague, say, Java. I
> have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The
> "textualization" of Python3 seems part of a conscious effort to make
> Python more Java-esque.

It's not just the JVM that says text and binary APIs should be separate -
it's every widely used operating system services layer except POSIX. The
POSIX way works well *if* everyone reliably encodes things as UTF-8 or
always uses encoding detection, but its failure mode is unfortunately
silent data corruption.

That said, there's a lot of Python software that is POSIX specific, where
bytes paths would be the least of the barriers to porting to Windows or
Jython. I'm personally +1 on consistently allowing binary paths in lower
level APIs, but disallowing them in higher level explicitly cross platform
abstractions like pathlib.

Regards,
Nick.

>
>
> Marko
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-20 Thread Paul Moore
On 20 August 2014 07:53, Ben Finney  wrote:
> "Stephen J. Turnbull"  writes:
>
>> Marko Rauhamaa writes:
>>  > Unix programmers, though, shouldn't be shielded from bytes.
>>
>> Nobody's trying to do that.  But Python users should be shielded from
>> Unix programmers.
>
> +1 QotW

That quote is actually almost a "hidden extra Zen of Python" IMO :-)
Both parts of it.

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Ben Finney
"Stephen J. Turnbull"  writes:

> Marko Rauhamaa writes:
>  > Unix programmers, though, shouldn't be shielded from bytes.
>
> Nobody's trying to do that.  But Python users should be shielded from
> Unix programmers.

+1 QotW

-- 
 \“Intellectual property is to the 21st century what the slave |
  `\  trade was to the 16th.” —David Mertz |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Stephen J. Turnbull
Marko Rauhamaa writes:

 > Unix programmers, though, shouldn't be shielded from bytes.

Nobody's trying to do that.  But Python users should be shielded from
Unix programmers.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Stephen J. Turnbull
Guido van Rossum writes:
 > On Tuesday, August 19, 2014, Stephen J. Turnbull  wrote:
 > > Greg Ewing writes:

 > >  > So maybe the way to make bytes paths go away is to always
 > >  > use surrogateescape for paths on unix?
 > >
 > > Backward compatibility rules that out, I think.  I certainly would
 > > recommend that for new code, but even for new code there are many
 > > users who vehemently object to using Unicode as an intermediate
 > > representation of things they think of as binary blobs.  Not worth the
 > > hassle to even seriously propose removing those APIs IMO.
 > 
 > But maybe we don't have to add new ones?

IMO, we should avoid it.

There may be some use cases.  Sergiy mentions two bug reports.

http://bugs.python.org/issue19997 imghdr.what doesn't accept bytes paths
http://bugs.python.org/issue20797 zipfile.extractall should accept bytes path 
as parameter

I'm very unsympathetic to these.  In both cases the bytes are coming
from outside of module in question.  Why are they in bytes?  That
question should scare you, because from the point of view of end users
there are no good answers: they all mean that the end user is going to
end up with uninterpretable bytes in their directories, for the
convenience of the programmer.

In the case of issue20797, I'd be a *little* sympathetic if the RFE
were for the *members* argument.  zipfiles evidently have no way to
specify the encodings of the name(s) of their members (and the zipfile
module doesn't have APIs for it!), so the programmer is kind of stuck,
especially if the requirement is that the extraction require no user
intervention.  But again, this is rarely what the user wants.

I would be sympathetic to an internal, bytes-based, "kids these stunts
are performed by trained professionals do NOT try this at home" API,
with a sane user-oriented str-based API for ordinary use for this
module.  I suppose it might be useful for such a multi-type API to be
polymorphic, but it would have to be a "if there are bytes anywhere,
everything must be bytes and return values will be bytes" and
similarly for str kind of polymorphism.  No mixing bytes and strings,
period.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Marko Rauhamaa
Guido van Rossum :

> With my serious hat on, I would like to claim that *conceptually*
> filenames are most definitely text. Due to various historical
> accidents the UNIX system calls often encoded text as arguments, and
> we sometimes need to control that encoding.

Due to historical accidents, text (in the Python sense) is not a
first-class data type in Unix. Text, machine language, XML, Python etc
are interpretations of bytes. Bytes are the first-class data type
recognized by the kernel. That reality cannot be wished away.

> Hence the occasional need for bytes arguments. But most of the time
> you don't have to think about that, and forcing users to worry about
> it is mostly as counter-productive as forcing to think about the
> encoding of every text file.

The users of Python programs can often be given higher-level facades.
Unix programmers, though, shouldn't be shielded from bytes.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Guido van Rossum
On Tuesday, August 19, 2014, Stephen J. Turnbull  wrote:

> Greg Ewing writes:
>  > Stephen J. Turnbull wrote:
>  >
>  > > This case can be handled now using the surrogateescape
>  > > error handler,
>  >
>  > So maybe the way to make bytes paths go away is to always
>  > use surrogateescape for paths on unix?
>
> Backward compatibility rules that out, I think.  I certainly would
> recommend that for new code, but even for new code there are many
> users who vehemently object to using Unicode as an intermediate
> representation of things they think of as binary blobs.  Not worth the
> hassle to even seriously propose removing those APIs IMO.


But maybe we don't have to add new ones?

--Guido


-- 
--Guido van Rossum (on iPad)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Stephen J. Turnbull
Greg Ewing writes:
 > Stephen J. Turnbull wrote:
 > 
 > > This case can be handled now using the surrogateescape
 > > error handler,
 > 
 > So maybe the way to make bytes paths go away is to always
 > use surrogateescape for paths on unix?

Backward compatibility rules that out, I think.  I certainly would
recommend that for new code, but even for new code there are many
users who vehemently object to using Unicode as an intermediate
representation of things they think of as binary blobs.  Not worth the
hassle to even seriously propose removing those APIs IMO.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Guido van Rossum
I'm sorry my moment of levity was taken so seriously.

With my serious hat on, I would like to claim that *conceptually* filenames
are most definitely text. Due to various historical accidents the UNIX
system calls often encoded text as arguments, and we sometimes need to
control that encoding. Hence the occasional need for bytes arguments. But
most of the time you don't have to think about that, and forcing users to
worry about it is mostly as counter-productive as forcing to think about
the encoding of every text file.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Greg Ewing

Stephen J. Turnbull wrote:


This case can be handled now using the surrogateescape
error handler,


So maybe the way to make bytes paths go away is to always
use surrogateescape for paths on unix?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Greg Ewing

Ben Hoyt wrote:

Does that mean that new APIs should explicitly not support bytes? 

> ... Bytes paths are essentially broken on Windows.

But on Unix, paths are essentially bytes. What's the
official policy for dealing with that?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Stephen J. Turnbull
Ben Hoyt writes:

 > Fair enough. I don't quite understand, though -- why is the "official
 > policy" to kill something that's "essential" on *nix?

They're not essential on *nix.  Unix paths at the OS level are "just
bytes" (even on Mac, although the most common Mac filesystem does
enforce UTF-8 Unicode NFD).  This use case is now perfectly well
served by codecs.

However, there are a lot of applications that involve reading a file
name from a directory, and passing it verbatim to another OS
function.  This case can be handled now using the surrogateescape
error handler, but when these APIs were introduced we didn't even have
a reliable way to roundtrip filenames because a Unix filename doesn't
need to be a string of characters from *any* character set.

And there's the undeniable convenience of treating file names as
opaque objects in those applications.

Regards,

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Marko Rauhamaa
Tres Seaver :

> On 08/19/2014 01:43 PM, Ben Hoyt wrote:
>> Fair enough. I don't quite understand, though -- why is the "official
>> policy" to kill something that's "essential" on *nix?
>
> ISTM that the policy is based on a fantasy that "it looks like text to
> me in my use cases, so therefore it must be text for everyone."

What I like about Python is that it allows me to write native linux code
without having to make portability compromises that plague, say, Java. I
have select.epoll(). I have os.fork(). I have socket.TCP_CORK. The
"textualization" of Python3 seems part of a conscious effort to make
Python more Java-esque.


Marko
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Bytes path support

2014-08-19 Thread Antoine Pitrou

Le 19/08/2014 13:43, Ben Hoyt a écrit :

The official policy is that we want them [support for bytes paths in stdlib 
functions] to go away, but reality so far has not budged. We will continue to 
hold our breath though. :-)


Does that mean that new APIs should explicitly not support bytes? I'm
thinking of os.scandir() (PEP 471), which I'm implementing at the
moment. I was originally going to make it support bytes so it was
compatible with listdir, but maybe that's a bad idea. Bytes paths are
essentially broken on Windows.


Bytes paths are "essential" on Unix, though, so I don't think we should
create new low-level APIs that don't support bytes.


Fair enough. I don't quite understand, though -- why is the "official
policy" to kill something that's "essential" on *nix?


PEP 383 should actually work on Unix quite well, AFAIR.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >