Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

Glyph Lefkowitz Thu, 17 Nov 2016 11:02:06 -0800

> On Nov 17, 2016, at 6:43 AM, Mark Williams <[email protected]> wrote:
> 
> On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
>> However; is it really a regression to have py3 support for Words that just 
>> doesn't support other encodings yet?  It strikes me that this is just a bug, 
>> and that we should just fall back from UTF-8 to latin-1 in this scenario.  
>> But adding that fallback is a small additional fix (perhaps one that should 
>> be slated for 16.6.0 if you want to make it).
> 
> Falling back to latin-1 will address the most obvious issue exposed by
> the client in the re-opened ticket.  It will not fix the general issue.


This doesn't appear to be an answer to the "is it a regression" question though 
;-).  I'm still curious what you think there.

The _general_ issue is unfixable, except to use chardet upon encoding errors.  
As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

More importantly, IRC doesn't specify an encoding and it is also responsible 
for transmitting textual data intended to be input and consumed by humans.  If 
you can't decode it, faithfully replicating the on-the-wire encoding is of 
limited utility.  You can't write any code to process the data.

> Note that my sample was heavily biased towards European servers.
> Other IRC servers in other regions might prefer a different 8-bit
> encoding, like windows-1251 or Big5.  And often a single server will
> see a long tail (or at least a tail) of different 8-bit encodings.
> Listing all channels on a server, as the example script does, cannot
> be done with an implementation that decodes input as text prior to
> parsing it.  It's even possible to use chardet to detect encodings.

If chardet is installed, can it be specified as an encoding itself?  Like, 
b"garbage garbage".decode("chardet")?  This would make it possible to use 
without binding to the library; you just specify an encoding.  (The library is 
LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

> IRC's encoding situation mirrors file systems' one on POSIX.  A given
> path's components can be in multiple encodings.  I believe at least
> part of the reason FilePath's paths are bytes, even when
> surrogateescape exists, is that Unicode paths on POSIX systems would
> make FilePath unusable for perfectly valid use cases.  We can pretend
> that IRC has a defined encoding, but doing so will make unusable for
> perfectly valid use cases.

Here we go :-).

POSIX has an internally inconsistent model of how encodings work; they cannot 
possibly function correctly.

First off, let me put to rest the lie that paths are "really" bytes.  Paths are 
text.  They must be text because they have to transit through text-processing 
systems, such as windowing systems and and terminal programs.  Users must be 
able to visually identify and select them, as text.

This is significant because certain operations on paths-as-bytes will 
inevitably fail.  You can't type an invalidly-encoded pathname in your shell.  
If two paths differ by an incorrectly-encoded character you won't be able to 
visually distinguish between them without inspecting their contents.  This is 
why OS X forces all paths to be UTF-8, and why paths are "really" unicode 
(UCS-2, precisely) on Windows.

There's POSIX metadata which allows you to select an encoding; locale.  But, 
locale is per-process state, and, due to the fact that you can have multiple 
filesystems mounted simultaneously, it's impossible for this metadata to fully 
describe the state of any arbitrary path.  The standard metadata is 
insufficient.  This is why UI toolkits like GTK+ have adopted the policy of 
"ignore the locale, paths are UTF-8, deal with it 🕶".  As far back as GTK2, 
non-utf-8 path selection has been deprecated: 
<https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename
 
<https://developer.gnome.org/gtk2/stable/GtkFileSelection.html#gtk-file-selection-set-filename>>.

While a mis-encoded path is a failure, there are ways to treat paths as a data 
structure to allow for only partial failure.  They're a data structure because 
they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 
0x2F, and so you can fail on each individual path component; if you're lucky 
you don't need to present all the components in the path to manipulate it.

We don't do this in Twisted right now (as I was somewhat disappointed to 
discover while writing this), but we should, and more importantly we could; 
FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" 
rather than returning an encoding error.

To bring all this back to IRC though:

Mis-encoded IRC messages are not data structures; they're just strings.  
There's no opportunity for partial recovery beyond chardet and mojibake.  In 
most cases, partial recovery requires configuration.  Per-channel encodings, 
for example, or per-user, which have to be agreed upon out of band, in ways 
that IRC does not expose as metadata.

Given this situation, the only reasonable way forward as a community is to tell 
users that using anything other than UTF-8 is a misconfiguration and we need to 
be getting all those out-of-band agreements to switch to it.

-glyph

_______________________________________________
Twisted-Python mailing list
[email protected]
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

Reply via email to