Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 21:36, Mark Williams  wrote:
> 
> 
> 
> On Tuesday, November 22, 2016, Glyph Lefkowitz  > wrote:
> 
> Okay.  So.
> 
> The rule for reverts like this is: if you do something today, which is 
> correct usage of the API and produces an observably correct result, will that 
> be broken in the future if we fix it?  If so, then we need to revert because 
> the interface as released is unsupportable.
> 
> As it stands, we have a matrix of 4 behaviors:
> 
> 
> bytes
> text(ascii)
> text(nonascii)
> py2
> works
> works
> UnicodeDecodeError
> py3
> garbage
> works
> works
> 
> This... is actually... fine, surprisingly.
>  
> Given that matrix, how would this work on Python 2 and 3:
> 
> https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68
>  
> 
It wouldn't work on Python 3 yet.  But that's fine: the point is that it 
wouldn't work!  Buildbot can just block porting on that bug.

> And how would that code not have to change if a future release accommodates 
> Unicode on Python 2 or bytes on Python 3?

Because it will get broken / undefined behavior on the current implementation.  
We can always fix broken behavior!  What we can't do is fix broken behavior 
that also breaks other correct behavior or workarounds.  But in this case, 
there's a broken behavior (which we have on trunk) and a correct behavior 
(which we can implement in the future) and no way to coerce the broken behavior 
to do something valid via public API.

-glyph___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 21:35, John Santos  wrote:
> 
> Shouldn't this be "if you pass non-ascii text on py2, you'll get ..." ?

Yes. Thanks for that catch :).

-g___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread John Santos


Been lurking here, no cows in the fire, no irons in the race, or 
whatever, except wanting Twisted to be perfect and easy to use and being 
perennially confused by text encoding, but I did notice this:


On 11/22/2016 9:03 PM, Glyph Lefkowitz wrote:

[...]


Okay.  So.

The rule for reverts like this is: if you do something today, which is 
correct usage of the API and produces an observably correct result, 
will that be broken in the future if we fix it?  If so, then we need 
to revert because the interface as released is unsupportable.


As it stands, we have a matrix of 4 behaviors:



*bytes*

*text(ascii)*

*text(nonascii)*
*py2*

works

works

UnicodeDecodeError
*py3*

garbage

works

works


This... is actually... fine, surprisingly.

The /right/ thing to do is to write code that passes text all the 
time.  If you do that right now, it'll work on py3 and raise an 
exception on py2, unless it /happens/ to be ASCII, in which case it'll 
work.


If you write code that passes bytes on py3, it'll just be garbage. 
 But, we want to deprecate that anyway, and you can't get correct, 
usable behavior out of it, no matter what workarounds you stuff in; so 
it's a bug, and can be fixed like any bug.


Similarly if you pass non-ascii text on py3, you'll get a 
UnicodeDecodeError.


Shouldn't this be "if you pass non-ascii text on *py2, *you'll get ..." ?

[...]

-glyph



Pedantically yours,

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Mark Williams
On Tuesday, November 22, 2016, Glyph Lefkowitz 
wrote:
>
>
> Okay.  So.
>
> The rule for reverts like this is: if you do something today, which is
> correct usage of the API and produces an observably correct result, will
> that be broken in the future if we fix it?  If so, then we need to revert
> because the interface as released is unsupportable.
>
> As it stands, we have a matrix of 4 behaviors:
>
>
> *bytes*
> *text(ascii)*
> *text(nonascii)*
> *py2*
> works
> works
> UnicodeDecodeError
> *py3*
> garbage
> works
> works
>
> This... is actually... fine, surprisingly.
>

Given that matrix, how would this work on Python 2 and 3:

https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68

And how would that code not have to change if a future release accommodates
Unicode on Python 2 or bytes on Python 3?
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 20:27, Mark Williams  wrote:
> 
> On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
>> 
>> OK, this whole time I thought we were talking about a sensible application 
>> of text_type to the API, perhaps with some leniency for bytes-ish-ness on 
>> python 2.  I haven't reviewed the PR, I was just responding to the concerns 
>> as raised on the list.
> 
> Sorry - I didn't mean to steer this towards API bike shedding.
> 
>> If it's just randomly encoding on one version and not the other, and correct 
>> usage of the API depends on *users* doing 'if PY2:' in their own code, then 
>> perhaps Mark's concern is indeed well-founded and we should roll it back 
>> before 16.6.
>> 
> 
> Tristan's exactly right.  Furthermore, if we decide to make IRCClient
> call its various command methods with unicode strings on Python 2,
> we'll be breaking backwards compatibility.  This is what I meant when
> I wrote:
> 
> On Nov 20, 2016, at 19:35, Mark Williams  wrote:
>> 
>> Yes.  Here's the lede: IRCClient should deal in bytes and we should
>> introduce a ProtocolWrapper-like thing that encodes and decodes
>> command prefixes and parameters.  It should implement an interface,
>> and we can start with an implementation that only knows about UTF-8.
>> The obvious advantage of this is that you can more easily write
>> IRCClients that work on both Python 2 and 3.
>> 
> 
> But it totally wasn't clear - sorry!
> 
> Of course, I also want IRC client implementation that lets me get at
> bytes, but that's a discussion I'll move to a new thread.
> 
> Given the inconsistency between Python 2 and Python 3, do we proceed
> with the revert?

Okay.  So.

The rule for reverts like this is: if you do something today, which is correct 
usage of the API and produces an observably correct result, will that be broken 
in the future if we fix it?  If so, then we need to revert because the 
interface as released is unsupportable.

As it stands, we have a matrix of 4 behaviors:


bytes
text(ascii)
text(nonascii)
py2
works
works
UnicodeDecodeError
py3
garbage
works
works

This... is actually... fine, surprisingly.

The right thing to do is to write code that passes text all the time.  If you 
do that right now, it'll work on py3 and raise an exception on py2, unless it 
happens to be ASCII, in which case it'll work.

If you write code that passes bytes on py3, it'll just be garbage.  But, we 
want to deprecate that anyway, and you can't get correct, usable behavior out 
of it, no matter what workarounds you stuff in; so it's a bug, and can be fixed 
like any bug.

Similarly if you pass non-ascii text on py3, you'll get a UnicodeDecodeError.

This is not a good situation, but it's totally fixable without breaking the 
interface.  We just fix the py2 version to accept text_type as well, and if 
Mark sneaks in a patch that makes py3 do the right thing with bytes, well, I 
don't know that I can stop him.

More importantly, it would probably be a smaller change to fix the methods (we 
could even fix them one at a time; say, action, join, etc) than to un-port and 
re-port the whole thing.

So: yes, it's broken, and in a worse way than I thought.  To get it to the 
point where we can actually implement logic consistently between two versions, 
we need to add a flag to IRCClient's constructor which is default-false on py2 
and default-true on py3 which says "give me text", so that callbacks like 
privmsg and joined can start receiving text_type on py2 as well as py3; right 
now it has to receive str because they've previously received str.  But that's 
a separate issue.

I am open to the idea that I have evaluated this incorrectly though, since this 
has been possibly the most confusing change since 
https://twistedmatrix.com/trac/ticket/411 
.  But as of right now I still think 
we shouldn't revert.

-glyph

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Mark Williams
On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
>
> OK, this whole time I thought we were talking about a sensible application of 
> text_type to the API, perhaps with some leniency for bytes-ish-ness on python 
> 2.  I haven't reviewed the PR, I was just responding to the concerns as 
> raised on the list.

Sorry - I didn't mean to steer this towards API bike shedding.

> If it's just randomly encoding on one version and not the other, and correct 
> usage of the API depends on *users* doing 'if PY2:' in their own code, then 
> perhaps Mark's concern is indeed well-founded and we should roll it back 
> before 16.6.
>

Tristan's exactly right.  Furthermore, if we decide to make IRCClient
call its various command methods with unicode strings on Python 2,
we'll be breaking backwards compatibility.  This is what I meant when
I wrote:

On Nov 20, 2016, at 19:35, Mark Williams  wrote:
>
> Yes.  Here's the lede: IRCClient should deal in bytes and we should
> introduce a ProtocolWrapper-like thing that encodes and decodes
> command prefixes and parameters.  It should implement an interface,
> and we can start with an implementation that only knows about UTF-8.
> The obvious advantage of this is that you can more easily write
> IRCClients that work on both Python 2 and 3.
>

But it totally wasn't clear - sorry!

Of course, I also want IRC client implementation that lets me get at
bytes, but that's a discussion I'll move to a new thread.

Given the inconsistency between Python 2 and Python 3, do we proceed
with the revert?

-Mark

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 18:27, Tristan Seligmann  wrote:
> 
> On Wed, 23 Nov 2016 at 01:26 Tristan Seligmann  > wrote:
> if PY3:
> 
> Argh, the above should be if PY2 of course.

OK, this whole time I thought we were talking about a sensible application of 
text_type to the API, perhaps with some leniency for bytes-ish-ness on python 
2.  I haven't reviewed the PR, I was just responding to the concerns as raised 
on the list.

If it's just randomly encoding on one version and not the other, and correct 
usage of the API depends on *users* doing 'if PY2:' in their own code, then 
perhaps Mark's concern is indeed well-founded and we should roll it back before 
16.6.

-glyph

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Wed, 23 Nov 2016 at 01:14 Tristan Seligmann 
wrote:

> On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz 
> wrote:
>
>
> This is the part that I'm worried about.  It kinda seems like we're moving
> toward "native string" being the type used in IRCClient, and *that* is
> capital-W Wrong.  Native strings are for Python-native types only, i.e.
> docstrings and method names.
>
>
> Unless I'm misunderstanding, we're not "moving towards" it, we have *already
> arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode)
> on Python 3. Even if we want a unicode API, having it only exist on Python
> 3 seems incredibly confusing from a user standpoint, and would appear to
> require some absurd contortions to write client code that behaves
> approximately the same on both Python 2 and 3.
>

For example, as far as I can tell, the only way to write code to join a
channel named #tëst (UTF-8 encoded) is:

channel = u'#tëst'
if PY3:
channel = channel.encode('utf-8')
client.join(channel)

On Python 3, client.join(b'#t\xc3\xab') will try to send JOIN b'#t\xc3\xab',
which is garbage, whereas on Python 2, client.join(u'#t\xebst') will
produce a UnicodeEncodeError.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Wed, 23 Nov 2016 at 01:26 Tristan Seligmann 
wrote:

> if PY3:
>

Argh, the above should be if PY2 of course.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz 
wrote:

>
> This is the part that I'm worried about.  It kinda seems like we're moving
> toward "native string" being the type used in IRCClient, and *that* is
> capital-W Wrong.  Native strings are for Python-native types only, i.e.
> docstrings and method names.
>

Unless I'm misunderstanding, we're not "moving towards" it, we have *already
arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode) on
Python 3. Even if we want a unicode API, having it only exist on Python 3
seems incredibly confusing from a user standpoint, and would appear to
require some absurd contortions to write client code that behaves
approximately the same on both Python 2 and 3.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 20, 2016, at 19:35, Mark Williams  wrote:
> 
> On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
>> 
>> "doesn't work" is a pretty black-and-white assessment.  Are you anticipating 
>> a problem with the way the interface is specified that it can't be easily 
>> changed?
>> 
> 
> Yes.  Here's the lede:

Thank you for summarizing!  Point by point, here's my position:

> IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like 
> thing that encodes and decodes
> command prefixes and parameters.

I disagree.  Any user-facing API should deal in unicode objects.  (There is one 
caveat here; there really should be a separate layer for dealing with text; 
IRCClient being a subclassing-based API pollutes the whole issue.  But that API 
shouldn't be public, so this is largely minutae; the "right" answer here has 
nothing to do with bytes or text and everything to do with adopting .)

> It should implement an interface, and we can start with an implementation 
> that only knows about UTF-8.

We should have the implementation initially know about UTF-8, yes.

> The obvious advantage of this is that you can more easily write IRCClients 
> that work on both Python 2 and 3.

This is the part that I'm worried about.  It kinda seems like we're moving 
toward "native string" being the type used in IRCClient, and that is capital-W 
Wrong.  Native strings are for Python-native types only, i.e. docstrings and 
method names.


> I'm also not entirely sure of the consequences of this interface
> change.  I think it deserves more thought before it becomes an API
> that we have to support.  This is the primary reason I opened the
> revert PR.

One of the things that's informing my decision is that IRCClient is already an 
incredibly ill-defined API that probably needs to be deprecated and overhauled 
at some point.  However, in the intervening (what will almost certainly be a) 
decade, I'd like it to work on Python 3.

> I'm more precisely worried about the fact that the implementation
> raises a decoding exception that cannot be handled in user code when
> it receives non-UTF-8 messages,

The right way to deal with this is twofold:

Add the ability to specify both the "encoding" and the "errors" of the relevant 
codec >, so that we 
can choose error handling strategies.
(potentially, if you have very nuanced requirements for dealing with weird 
encodings) write a codec that logs and handles its own errors.  (We probably 
shouldn't be logging a traceback for encoding problems regardless, if it's 
UnicodeDecodeError.  But that's something that can easily be fixed in 
subsequent releases as well)

> and the fact that the line length checks occur prior to encoding, ensuring 
> mid-codepoint truncation. These issues also contributed to my revert.

Line length checks are a super interesting example because I think they also 
illustrate my concerns as well.

To properly do message-splitting (which is why we're checking line length), you 
have to:

check the length in octets (because it's actually a message-length limit in 
octets, not a line-length limit in characters)
split the textual representation - ideally somewhere relevant like a word 
break, which you can only detect in text!
try encoding again and ensure that the encoded representation is the correct 
length, repeating if necessary.

This is an implementation-level bug though, not an interface-level one, so I'm 
also comfortable fixing this bug in the future.

>> My points are, separately:
>> 
>> IRC is text. It's nonsensical to process it as bytes, because you can't 
>> process it as bytes.  This is separate from the question of "what encoding 
>> is IRC".
> 
> It's nonsensical that it be finally presented to a human as raw bytes.
> I'm advocating for the decision to be made as late as possible.  That
> doesn't mean we can't provide an easy-to-use recoding client that we
> encourage people to turn to first.

You can't process it as bytes either, though.  In some cases you think you can, 
but then you get mid-codepoint truncation :-).

> But we can't have *a* fallback encoding.  My encoding detector program
> indicates that latin-1 is the second most popular encoding for
> European IRC servers, but Russian servers I sampled (not in
> netsplit.de's top 10) used a variety of Cyrillic encodings.

If you really want to do something this sophisticated (and, I should note: no 
other IRC clients or bots I'm aware of do, so I think you've got an 
unrealistically tight set of requirements) then you can just write your own 
single codec that composes a bunch of others, and install it.  Python's 
encoding system is extensible for exactly this reason :).

> I also want to enable arbitrary recovery strategies for bad encodings.

This is totally not an IRC-specific thing though :-).

> For 

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-20 Thread Mark Williams
On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
>
> "doesn't work" is a pretty black-and-white assessment.  Are you anticipating 
> a problem with the way the interface is specified that it can't be easily 
> changed?
>

Yes.  Here's the lede: IRCClient should deal in bytes and we should
introduce a ProtocolWrapper-like thing that encodes and decodes
command prefixes and parameters.  It should implement an interface,
and we can start with an implementation that only knows about UTF-8.
The obvious advantage of this is that you can more easily write
IRCClients that work on both Python 2 and 3.  I'll attempt to explain
others in the rest of this email.

>
> I should say up front here that I think I was being too emphatic in my 
> support for UTF-8.
>

Phew!

>
> Test regressions are listed because they're unambiguously cause for a revert; 
> "undesirable" is intentionally vague because we might decide to revert a 
> thing for no reason.  I guess opening a PR for a discussion like this is 
> reasonable.
>

Good to know!

> This could be considered an incompatible interface change; I'm honestly not 
> sure about the exact type signatures of various methods to say whether it is 
> or not.
>

I'm also not entirely sure of the consequences of this interface
change.  I think it deserves more thought before it becomes an API
that we have to support.  This is the primary reason I opened the
revert PR.

I'm more precisely worried about the fact that the implementation
raises a decoding exception that cannot be handled in user code when
it receives non-UTF-8 messages, and the fact that the line length
checks occur prior to encoding, ensuring mid-codepoint truncation.
These issues also contributed to my revert.

>
> My points are, separately:
>
> IRC is text. It's nonsensical to process it as bytes, because you can't 
> process it as bytes.  This is separate from the question of "what encoding is 
> IRC".

It's nonsensical that it be finally presented to a human as raw bytes.
I'm advocating for the decision to be made as late as possible.  That
doesn't mean we can't provide an easy-to-use recoding client that we
encourage people to turn to first.

> UTF-8 is good. There should be gradual social pressure to use UTF-8 
> everywhere (I'm a fan of http://utf8everywhere.org 
> ).  This is especially true in protocols like IRC 
> and filenames where there's no mechanism to specify an encoding so that it 
> can be correctly decoded.  Therefore:
> an initial release which features UTF-8 only is fine; therefore there's no 
> need to do a revert.
> defaulting to UTF-8 is reasonable for the forseeable future; users should 
> only change this if they know that they want something unusual.
> IRC is an incompatible and broken wasteland; thanks to your quantitative 
> research we know exactly how broken.  Therefore:
> "support alternate encodings" is a valuable feature.  Supporting point 2.1, 
> this feature can be added on at any later point, making a revert of the 
> present implementation unnecessary.
> We can, and should, just go ahead and add support for alternate (per-server, 
> per-channel, per-user) default and fallback encodings.
> We should always have a fallback encoding, since blowing up on "invalid" data 
> on a protocol where there's no standard to say what is or isn't valid doesn't 
> seem very helpful.
>

I appreciate the consistency of this, and agree the documented
preference should be a client implementation that assumes UTF-8.  But
we can't have *a* fallback encoding.  My encoding detector program
indicates that latin-1 is the second most popular encoding for
European IRC servers, but Russian servers I sampled (not in
netsplit.de's top 10) used a variety of Cyrillic encodings.

I also want to enable arbitrary recovery strategies for bad encodings.
For instance, in the case that an IRC client or server truncates a
code point at a line boundary, it might be the right idea to binary
search until the invalid byte sequence is found, and then exclude it.
It might be the right idea to buffer the message for a time in the
hopes that the codepoint got split over two lines.

And what if somebody wants to run another encoding survey?

I don't expect most users to do any of that, but *I* certainly want to
without having to copy and paste a bunch of code.

> >
> > When I received Arabic PDFs on a FAT16 USB drive with filenames in
> > CP1256, I had to switch mlterm to that particular code page to read
> > the directory listings so I could use convmv to convert them to UTF-8.
>
> There is no question that your life has been hard, and that a wide array of 
> people have made bad decisions that contribute to your difficulties. :-)
>

My real point was that dealing with bad encodings is not theoretical.
Nobody knew the encoding, by the way; they just knew the USB drive
worked for some of them and not others, and were resulting to printing
things out or taking screen shots.

That's the 

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-18 Thread Glyph Lefkowitz

> On Nov 18, 2016, at 12:13 AM, Mark Williams  wrote:
> 
> On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
>> 
>> This doesn't appear to be an answer to the "is it a regression" question 
>> though ;-).  I'm still curious what you think there.
> 
> It's not a shipped feature so it can't be a regression.  But if the
> feature doesn't work it shouldn't be shipped.

"doesn't work" is a pretty black-and-white assessment.  Are you anticipating a 
problem with the way the interface is specified that it can't be easily changed?

I should say up front here that I think I was being too emphatic in my support 
for UTF-8.  We absolutely must support the ability to decode other encodings.  
I don't think that means we need support for access to raw bytes.

> I did consult the policy manual before opening revert PR.  Here's what
> seemed most relevant:
> 
> https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange
> 
> This, and the other revert documents, focus on test regressions.  But
> I opened the PR because of the above link's mention of "undesirable."
> Is there a better resource that explains when a revert is appropriate?

Test regressions are listed because they're unambiguously cause for a revert; 
"undesirable" is intentionally vague because we might decide to revert a thing 
for no reason.  I guess opening a PR for a discussion like this is reasonable.

This could be considered an incompatible interface change; I'm honestly not 
sure about the exact type signatures of various methods to say whether it is or 
not.

>> The _general_ issue is unfixable, except to use chardet upon encoding 
>> errors.  As far as I'm aware, IRC simply doesn't have the ability to specify 
>> an encoding.
> 
> IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
> protocol elements (usernames and metadata).  But it needs to be
> backwards compatible, so it can't mandate it for all messages.  And it
> is not IRC as specified by RFC1459.  So no, no defined encoding.

Not only "no defined encoding" but also no mechanism like HTTP headers to say 
what the encoding is.

>> More importantly, IRC doesn't specify an encoding and it is also responsible 
>> for transmitting textual data intended to be input and consumed by humans.  
>> If you can't decode it, faithfully replicating the on-the-wire encoding is 
>> of limited utility.  You can't write any code to process the data.
> 
> I can write code that uses the encoding that makes sense for my use
> case.  I can't if we mandate utf-8, even when I receive perfectly
> valid IRC messages.

Sorry, I haven't been separating out my lines of reasoning clearly enough here.

My points are, separately:

IRC is text. It's nonsensical to process it as bytes, because you can't process 
it as bytes.  This is separate from the question of "what encoding is IRC".
UTF-8 is good. There should be gradual social pressure to use UTF-8 everywhere 
(I'm a fan of http://utf8everywhere.org ).  This is 
especially true in protocols like IRC and filenames where there's no mechanism 
to specify an encoding so that it can be correctly decoded.  Therefore:
an initial release which features UTF-8 only is fine; therefore there's no need 
to do a revert.
defaulting to UTF-8 is reasonable for the forseeable future; users should only 
change this if they know that they want something unusual.
IRC is an incompatible and broken wasteland; thanks to your quantitative 
research we know exactly how broken.  Therefore:
"support alternate encodings" is a valuable feature.  Supporting point 2.1, 
this feature can be added on at any later point, making a revert of the present 
implementation unnecessary.
We can, and should, just go ahead and add support for alternate (per-server, 
per-channel, per-user) default and fallback encodings.
We should always have a fallback encoding, since blowing up on "invalid" data 
on a protocol where there's no standard to say what is or isn't valid doesn't 
seem very helpful.

>> If chardet is installed, can it be specified as an encoding itself?  Like, 
>> b"garbage garbage".decode("chardet")?  This would make it possible to use 
>> without binding to the library; you just specify an encoding.  (The library 
>> is LGPL2.1 which makes it a problematic dependency for Twisted, even 
>> optionally.)
> 
> It does not, but if that makes it more generally usable you've given a
> great idea for my next PyPI package :)

Let me know :).

>> POSIX has an internally inconsistent model of how encodings work; they 
>> cannot possibly function correctly.
>> 
>> First off, let me put to rest the lie that paths are "really" bytes.  Paths 
>> are text.  They must be text because they have to transit through 
>> text-processing systems, such pas windowing systems and and terminal 
>> programs.  Users must be able to visually identify and select them, as text.
>> 
>> This is significant because certain operations on 

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-18 Thread Mark Williams
On Thu, Nov 17, 2016 at 11:00:13AM -0800, Glyph Lefkowitz wrote:
>
> This doesn't appear to be an answer to the "is it a regression" question 
> though ;-).  I'm still curious what you think there.

It's not a shipped feature so it can't be a regression.  But if the
feature doesn't work it shouldn't be shipped.

I did consult the policy manual before opening revert PR.  Here's what
seemed most relevant:

https://twistedmatrix.com/trac/wiki/ReviewProcess#Revertingachange

This, and the other revert documents, focus on test regressions.  But
I opened the PR because of the above link's mention of "undesirable."
Is there a better resource that explains when a revert is appropriate?

> The _general_ issue is unfixable, except to use chardet upon encoding errors. 
>  As far as I'm aware, IRC simply doesn't have the ability to specify an 
> encoding.

IRCv3 (http://ircv3.net/) is attempting to mandate utf-8 for certain
protocol elements (usernames and metadata).  But it needs to be
backwards compatible, so it can't mandate it for all messages.  And it
is not IRC as specified by RFC1459.  So no, no defined encoding.

> More importantly, IRC doesn't specify an encoding and it is also responsible 
> for transmitting textual data intended to be input and consumed by humans.  
> If you can't decode it, faithfully replicating the on-the-wire encoding is of 
> limited utility.  You can't write any code to process the data.

I can write code that uses the encoding that makes sense for my use
case.  I can't if we mandate utf-8, even when I receive perfectly
valid IRC messages.

>
> If chardet is installed, can it be specified as an encoding itself?  Like, 
> b"garbage garbage".decode("chardet")?  This would make it possible to use 
> without binding to the library; you just specify an encoding.  (The library 
> is LGPL2.1 which makes it a problematic dependency for Twisted, even 
> optionally.)
>

It does not, but if that makes it more generally usable you've given a
great idea for my next PyPI package :)

> POSIX has an internally inconsistent model of how encodings work; they cannot 
> possibly function correctly.
>
> First off, let me put to rest the lie that paths are "really" bytes.  Paths 
> are text.  They must be text because they have to transit through 
> text-processing systems, such pas windowing systems and and terminal 
> programs.  Users must be able to visually identify and select them, as text.
>
> This is significant because certain operations on paths-as-bytes will 
> inevitably fail.  You can't type an invalidly-encoded pathname in your shell. 
>  If two paths differ by an incorrectly-encoded character you won't be able to 
> visually distinguish between them without inspecting their contents.  This is 
> why OS X forces all paths to be UTF-8, and why paths are "really" unicode 
> (UCS-2, precisely) on Windows.
>
> There's POSIX metadata which allows you to select an encoding; locale.  But, 
> locale is per-process state, and, due to the fact that you can have multiple 
> filesystems mounted simultaneously, it's impossible for this metadata to 
> fully describe the state of any arbitrary path.  The standard metadata is 
> insufficient.  This is why UI toolkits like GTK+ have adopted the policy of 
> "ignore the locale, paths are UTF-8, deal with it ".  As far back as GTK2, 
> non-utf-8 path selection has been deprecated: 
>   
> >.
>

When I received Arabic PDFs on a FAT16 USB drive with filenames in
CP1256, I had to switch mlterm to that particular code page to read
the directory listings so I could use convmv to convert them to UTF-8.
I'll note that this was impossible to do with a GTK-based tool.

Opinionated software is fine when it operates at the point of user
interpretation.

mlterm had to decode the stuff as unicode so X could display the
graphemes.  But if Linux's FAT16 implementation decided that we should
all quit whining and use UTF-8, even though no other FAT16
implementation requires this, it wouldn't have mattered what mlterm
could or couldn't do and I would have lost those files.  And it would
have been incredibly confounding to me, because everything would have
agreed that I had a FAT16 partition, but only Linux would have
mysteriously failed to read it.

Similarly, Twisted provides an IRC *library*.  It's a Python API, not
irssi or Textual.  The ultimate consumer of what passes through it may
be a human, but the next consumer might not be.  What if I want to
write write a bot that bridges two IRC networks?  What if I want to
dump the raw IRC data to a file so I can train a tensorflow version of
chardet?  There's nothing in the IRC specification that prevents me
from doing this, but there will be something in Twisted's
implementation that does.

> While a mis-encoded path is a failure, there are 

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-17 Thread Glyph Lefkowitz

> On Nov 17, 2016, at 6:43 AM, Mark Williams  wrote:
> 
> On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
>> However; is it really a regression to have py3 support for Words that just 
>> doesn't support other encodings yet?  It strikes me that this is just a bug, 
>> and that we should just fall back from UTF-8 to latin-1 in this scenario.  
>> But adding that fallback is a small additional fix (perhaps one that should 
>> be slated for 16.6.0 if you want to make it).
> 
> Falling back to latin-1 will address the most obvious issue exposed by
> the client in the re-opened ticket.  It will not fix the general issue.

This doesn't appear to be an answer to the "is it a regression" question though 
;-).  I'm still curious what you think there.

The _general_ issue is unfixable, except to use chardet upon encoding errors.  
As far as I'm aware, IRC simply doesn't have the ability to specify an encoding.

More importantly, IRC doesn't specify an encoding and it is also responsible 
for transmitting textual data intended to be input and consumed by humans.  If 
you can't decode it, faithfully replicating the on-the-wire encoding is of 
limited utility.  You can't write any code to process the data.

> Note that my sample was heavily biased towards European servers.
> Other IRC servers in other regions might prefer a different 8-bit
> encoding, like windows-1251 or Big5.  And often a single server will
> see a long tail (or at least a tail) of different 8-bit encodings.
> Listing all channels on a server, as the example script does, cannot
> be done with an implementation that decodes input as text prior to
> parsing it.  It's even possible to use chardet to detect encodings.

If chardet is installed, can it be specified as an encoding itself?  Like, 
b"garbage garbage".decode("chardet")?  This would make it possible to use 
without binding to the library; you just specify an encoding.  (The library is 
LGPL2.1 which makes it a problematic dependency for Twisted, even optionally.)

> IRC's encoding situation mirrors file systems' one on POSIX.  A given
> path's components can be in multiple encodings.  I believe at least
> part of the reason FilePath's paths are bytes, even when
> surrogateescape exists, is that Unicode paths on POSIX systems would
> make FilePath unusable for perfectly valid use cases.  We can pretend
> that IRC has a defined encoding, but doing so will make unusable for
> perfectly valid use cases.

Here we go :-).

POSIX has an internally inconsistent model of how encodings work; they cannot 
possibly function correctly.

First off, let me put to rest the lie that paths are "really" bytes.  Paths are 
text.  They must be text because they have to transit through text-processing 
systems, such as windowing systems and and terminal programs.  Users must be 
able to visually identify and select them, as text.

This is significant because certain operations on paths-as-bytes will 
inevitably fail.  You can't type an invalidly-encoded pathname in your shell.  
If two paths differ by an incorrectly-encoded character you won't be able to 
visually distinguish between them without inspecting their contents.  This is 
why OS X forces all paths to be UTF-8, and why paths are "really" unicode 
(UCS-2, precisely) on Windows.

There's POSIX metadata which allows you to select an encoding; locale.  But, 
locale is per-process state, and, due to the fact that you can have multiple 
filesystems mounted simultaneously, it's impossible for this metadata to fully 
describe the state of any arbitrary path.  The standard metadata is 
insufficient.  This is why UI toolkits like GTK+ have adopted the policy of 
"ignore the locale, paths are UTF-8, deal with it ".  As far back as GTK2, 
non-utf-8 path selection has been deprecated: 
>.

While a mis-encoded path is a failure, there are ways to treat paths as a data 
structure to allow for only partial failure.  They're a data structure because 
they must be in an encoding with no NULLs, which encode SOLIDUS as the octet 
0x2F, and so you can fail on each individual path component; if you're lucky 
you don't need to present all the components in the path to manipulate it.

We don't do this in Twisted right now (as I was somewhat disappointed to 
discover while writing this), but we should, and more importantly we could; 
FilePath(b"\xff").child("valid").asTextMode().basename() could return u"valid" 
rather than returning an encoding error.

To bring all this back to IRC though:

Mis-encoded IRC messages are not data structures; they're just strings.  
There's no opportunity for partial recovery beyond chardet and mojibake.  In 
most cases, partial recovery requires configuration.  Per-channel encodings, 
for example, or per-user, which 

Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-17 Thread Mark Williams
On Wed, Nov 16, 2016 at 11:22:49PM -0800, Glyph Lefkowitz wrote:
> However; is it really a regression to have py3 support for Words that just 
> doesn't support other encodings yet?  It strikes me that this is just a bug, 
> and that we should just fall back from UTF-8 to latin-1 in this scenario.  
> But adding that fallback is a small additional fix (perhaps one that should 
> be slated for 16.6.0 if you want to make it).

Falling back to latin-1 will address the most obvious issue exposed by
the client in the re-opened ticket.  It will not fix the general issue.

Note that my sample was heavily biased towards European servers.
Other IRC servers in other regions might prefer a different 8-bit
encoding, like windows-1251 or Big5.  And often a single server will
see a long tail (or at least a tail) of different 8-bit encodings.
Listing all channels on a server, as the example script does, cannot
be done with an implementation that decodes input as text prior to
parsing it.  It's even possible to use chardet to detect encodings.

IRC's encoding situation mirrors file systems' one on POSIX.  A given
path's components can be in multiple encodings.  I believe at least
part of the reason FilePath's paths are bytes, even when
surrogateescape exists, is that Unicode paths on POSIX systems would
make FilePath unusable for perfectly valid use cases.  We can pretend
that IRC has a defined encoding, but doing so will make unusable for
perfectly valid use cases.

> -glyph
>
>
> ___
> Twisted-Python mailing list
> Twisted-Python@twistedmatrix.com
> http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-16 Thread Amber "Hawkie" Brown

> On 17 Nov. 2016, at 18:50, Amber Hawkie Brown  
> wrote:
> 
>> 
>> On 17 Nov. 2016, at 18:22, Glyph Lefkowitz > > wrote:
>> 
>>> 
>>> On Nov 16, 2016, at 11:15 PM, Mark Williams >> > wrote:
>>> 
>>> On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
 - Python 3 support for Words' IRC support and twisted.protocols.sip among 
 some smaller modules,
>>> 
>>> I have opened a PR to revert this:
>>> 
>>> https://github.com/twisted/twisted/pull/593 
>>> 
>>> 
>>> A full explanation is here:
>>> 
>>> https://twistedmatrix.com/trac/ticket/6320#comment:16
>>> 
>>> In summary: a valid IRC message will cause a UnicodeDecodeError within
>>> the event loop that a user cannot handle or avoid, and all length
>>> checks on line sizes are wrong because they occur prior to encoding to
>>> utf-8.
>> 
>> Reverts should be commits that go straight to trunk and reopen tickets, per 
>> the current process.
>> 
>> However; is it really a regression to have py3 support for Words that just 
>> doesn't support other encodings yet?  It strikes me that this is just a bug, 
>> and that we should just fall back from UTF-8 to latin-1 in this scenario.  
>> But adding that fallback is a small additional fix (perhaps one that should 
>> be slated for 16.6.0 if you want to make it).
>> 
>> -glyph
> 
> Yeah, this is just a plain old bug. Bugs in new features (where a module 
> being on Python 3 counts as one to me) aren't regressions; we sometimes fix 
> them in pre if there's time/other stuff is getting fixed, but this one will 
> just be a known bug until 16.7 in December.
> 
> - Amber

(or a 16.6.1)___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-16 Thread Amber "Hawkie" Brown

> On 17 Nov. 2016, at 18:22, Glyph Lefkowitz  wrote:
> 
>> 
>> On Nov 16, 2016, at 11:15 PM, Mark Williams  wrote:
>> 
>> On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
>>> - Python 3 support for Words' IRC support and twisted.protocols.sip among 
>>> some smaller modules,
>> 
>> I have opened a PR to revert this:
>> 
>> https://github.com/twisted/twisted/pull/593
>> 
>> A full explanation is here:
>> 
>> https://twistedmatrix.com/trac/ticket/6320#comment:16
>> 
>> In summary: a valid IRC message will cause a UnicodeDecodeError within
>> the event loop that a user cannot handle or avoid, and all length
>> checks on line sizes are wrong because they occur prior to encoding to
>> utf-8.
> 
> Reverts should be commits that go straight to trunk and reopen tickets, per 
> the current process.
> 
> However; is it really a regression to have py3 support for Words that just 
> doesn't support other encodings yet?  It strikes me that this is just a bug, 
> and that we should just fall back from UTF-8 to latin-1 in this scenario.  
> But adding that fallback is a small additional fix (perhaps one that should 
> be slated for 16.6.0 if you want to make it).
> 
> -glyph

Yeah, this is just a plain old bug. Bugs in new features (where a module being 
on Python 3 counts as one to me) aren't regressions; we sometimes fix them in 
pre if there's time/other stuff is getting fixed, but this one will just be a 
known bug until 16.7 in December.

- Amber___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-16 Thread Glyph Lefkowitz

> On Nov 16, 2016, at 11:15 PM, Mark Williams  wrote:
> 
> On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
>> - Python 3 support for Words' IRC support and twisted.protocols.sip among 
>> some smaller modules,
> 
> I have opened a PR to revert this:
> 
> https://github.com/twisted/twisted/pull/593
> 
> A full explanation is here:
> 
> https://twistedmatrix.com/trac/ticket/6320#comment:16
> 
> In summary: a valid IRC message will cause a UnicodeDecodeError within
> the event loop that a user cannot handle or avoid, and all length
> checks on line sizes are wrong because they occur prior to encoding to
> utf-8.

Reverts should be commits that go straight to trunk and reopen tickets, per the 
current process.

However; is it really a regression to have py3 support for Words that just 
doesn't support other encodings yet?  It strikes me that this is just a bug, 
and that we should just fall back from UTF-8 to latin-1 in this scenario.  But 
adding that fallback is a small additional fix (perhaps one that should be 
slated for 16.6.0 if you want to make it).

-glyph


___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-16 Thread Mark Williams
On Thu, Nov 10, 2016 at 07:56:52PM +1100, Amber "Hawkie" Brown wrote:
> - Python 3 support for Words' IRC support and twisted.protocols.sip among 
> some smaller modules,

I have opened a PR to revert this:

https://github.com/twisted/twisted/pull/593

A full explanation is here:

https://twistedmatrix.com/trac/ticket/6320#comment:16

In summary: a valid IRC message will cause a UnicodeDecodeError within
the event loop that a user cannot handle or avoid, and all length
checks on line sizes are wrong because they occur prior to encoding to
utf-8.

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


[Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-10 Thread Amber "Hawkie" Brown
Hi everyone, here's a Twisted release to hopefully lift your spirits a little. 
It's not a big one, but it's got some goodies regardless.

It features:

- The ability to use "python -m twisted" to call the new `twist` runner,
- More reliable tests from a more reliable implementation of some things, like 
IOCP,
- Fixes for async/await & twisted.internet.defer.ensureDeferred, meaning it's 
getting closer to prime time!
- ECDSA support in Conch & ckeygen (which has also been ported to Python 3),
- Python 3 support for Words' IRC support and twisted.protocols.sip among some 
smaller modules,
- Some HTTP/2 server optimisations,
- and a few bugfixes to boot!

You can get the tarball and the NEWS file at 
https://twistedmatrix.com/Releases/rc/16.6.0rc1/ 
 , or you can try it out from 
PyPI:

python -m pip install Twisted==16.6.0rc1

Please test it, and let me know how your applications fare, good or bad! If 
nothing comes up, I will release 16.6.0 next week.

- Amber___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python