Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-18 Thread Glyph Lefkowitz

> On Nov 18, 2016, at 12:13 AM, Mark Williams  wrote:
> 
> On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
>> 
>> This doesn't appear to be an answer to the "is it a regression" question 
>> though ;-).  I'm still curious what you think there.
> 
> It's not a shipped feature so it can't be a regression.  But if the
> feature doesn't work it shouldn't be shipped.

"doesn't work" is a pretty black-and-white assessment.  Are you anticipating a 
problem with the way the interface is specified that it can't be easily changed?

I should say up front here that I think I was being too emphatic in my support 
for UTF-8.  We absolutely must support the ability to decode other encodings.  
I don't think that means we need support for access to raw bytes.

> I did consult the policy manual before opening revert PR.  Here's what
> seemed most relevant:
> 
> https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange
> 
> This, and the other revert documents, focus on test regressions.  But
> I opened the PR because of the above link's mention of "undesirable."
> Is there a better resource that explains when a revert is appropriate?

Test regressions are listed because they're unambiguously cause for a revert; 
"undesirable" is intentionally vague because we might decide to revert a thing 
for no reason.  I guess opening a PR for a discussion like this is reasonable.

This could be considered an incompatible interface change; I'm honestly not 
sure about the exact type signatures of various methods to say whether it is or 
not.

>> The _general_ issue is unfixable, except to use chardet upon encoding 
>> errors.  As far as I'm aware, IRC simply doesn't have the ability to specify 
>> an encoding.
> 
> IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
> protocol elements (usernames and metadata).  But it needs to be
> backwards compatible, so it can't mandate it for all messages.  And it
> is not IRC as specified by RFC1459.  So no, no defined encoding.

Not only "no defined encoding" but also no mechanism like HTTP headers to say 
what the encoding is.

>> More importantly, IRC doesn't specify an encoding and it is also responsible 
>> for transmitting textual data intended to be input and consumed by humans.  
>> If you can't decode it, faithfully replicating the on-the-wire encoding is 
>> of limited utility.  You can't write any code to process the data.
> 
> I can write code that uses the encoding that makes sense for my use
> case.  I can't if we mandate utf-8, even when I receive perfectly
> valid IRC messages.

Sorry, I haven't been separating out my lines of reasoning clearly enough here.

My points are, separately:

IRC is text. It's nonsensical to process it as bytes, because you can't process 
it as bytes.  This is separate from the question of "what encoding is IRC".
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere 
(I'm a fan of http://utf8everywhere.org ).  This is 
especially true in protocols like IRC and filenames where there's no mechanism 
to specify an encoding so that it can be correctly decoded.  Therefore:
an initial release which features UTF-8 only is fine; therefore there's no need 
to do a revert.
defaulting to UTF-8 is reasonable for the forseeable future; users should only 
change this if they know that they want something unusual.
IRC is an incompatible and broken wasteland; thanks to your quantitative 
research we know exactly how broken.  Therefore:
"support alternate encodings" is a valuable feature.  Supporting point 2.1, 
this feature can be added on at any later point, making a revert of the present 
implementation unnecessary.
We can, and should, just go ahead and add support for alternate (per-server, 
per-channel, per-user) default and fallback encodings.
We should always have a fallback encoding, since blowing up on "invalid" data 
on a protocol where there's no standard to say what is or isn't valid doesn't 
seem very helpful.

>> If chardet is installed, can it be specified as an encoding itself?  Like, 
>> b"garbage garbage".decode("chardet")?  This would make it possible to use 
>> without binding to the library; you just specify an encoding.  (The library 
>> is LGPL2.1 which makes it a problematic dependency for Twisted, even 
>> optionally.)
> 
> It does not, but if that makes it more generally usable you've given a
> great idea for my next PyPI package :)

Let me know :).

>> POSIX has an internally inconsistent model of how encodings work; they 
>> cannot possibly function correctly.
>> 
>> First off, let me put to rest the lie that paths are "really" bytes.  Paths 
>> are text.  They must be text because they have to transit through 
>> text-processing systems, such pas windowing systems and and terminal 
>> programs.  Users must be able to visually identify and select them, as text.
>> 
>> This is significant because certain operations on paths-as-bytes will 
>> inevita

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-18 Thread Mark Williams
On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
>
> This doesn't appear to be an answer to the "is it a regression" question 
> though ;-).  I'm still curious what you think there.

It's not a shipped feature so it can't be a regression.  But if the
feature doesn't work it shouldn't be shipped.

I did consult the policy manual before opening revert PR.  Here's what
seemed most relevant:

https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange

This, and the other revert documents, focus on test regressions.  But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?

> The _general_ issue is unfixable, except to use chardet upon encoding errors. 
>  As far as I'm aware, IRC simply doesn't have the ability to specify an 
> encoding.

IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
protocol elements (usernames and metadata).  But it needs to be
backwards compatible, so it can't mandate it for all messages.  And it
is not IRC as specified by RFC1459.  So no, no defined encoding.

> More importantly, IRC doesn't specify an encoding and it is also responsible 
> for transmitting textual data intended to be input and consumed by humans.  
> If you can't decode it, faithfully replicating the on-the-wire encoding is of 
> limited utility.  You can't write any code to process the data.

I can write code that uses the encoding that makes sense for my use
case.  I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.

>
> If chardet is installed, can it be specified as an encoding itself?  Like, 
> b"garbage garbage".decode("chardet")?  This would make it possible to use 
> without binding to the library; you just specify an encoding.  (The library 
> is LGPL2.1 which makes it a problematic dependency for Twisted, even 
> optionally.)
>

It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)

> POSIX has an internally inconsistent model of how encodings work; they cannot 
> possibly function correctly.
>
> First off, let me put to rest the lie that paths are "really" bytes.  Paths 
> are text.  They must be text because they have to transit through 
> text-processing systems, such pas windowing systems and and terminal 
> programs.  Users must be able to visually identify and select them, as text.
>
> This is significant because certain operations on paths-as-bytes will 
> inevitably fail.  You can't type an invalidly-encoded pathname in your shell. 
>  If two paths differ by an incorrectly-encoded character you won't be able to 
> visually distinguish between them without inspecting their contents.  This is 
> why OS X forces all paths to be UTF-8, and why paths are "really" unicode 
> (UCS-2, precisely) on Windows.
>
> There's POSIX metadata which allows you to select an encoding; locale.  But, 
> locale is per-process state, and, due to the fact that you can have multiple 
> filesystems mounted simultaneously, it's impossible for this metadata to 
> fully describe the state of any arbitrary path.  The standard metadata is 
> insufficient.  This is why UI toolkits like GTK+ have adopted the policy of 
> "ignore the locale, paths are UTF-8, deal with it 🕶".  As far back as GTK2, 
> non-utf-8 path selection has been deprecated: 
>   
> >.
>

When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.
I'll note that this was impossible to do with a GTK-based tool.

Opinionated software is fine when it operates at the point of user
interpretation.

mlterm had to decode the stuff as unicode so X could display the
graphemes.  But if Linux's FAT16 implementation decided that we should
all quit whining and use UTF-8, even though no other FAT16
implementation requires this, it wouldn't have mattered what mlterm
could or couldn't do and I would have lost those files.  And it would
have been incredibly confounding to me, because everything would have
agreed that I had a FAT16 partition, but only Linux would have
mysteriously failed to read it.

Similarly, Twisted provides an IRC *library*.  It's a Python API, not
irssi or Textual.  The ultimate consumer of what passes through it may
be a human, but the next consumer might not be.  What if I want to
write write a bot that bridges two IRC networks?  What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet?  There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.

> While a mis-encoded path is a failure, there are w