Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 21:36, Mark Williams  wrote:
> 
> 
> 
> On Tuesday, November 22, 2016, Glyph Lefkowitz  > wrote:
> 
> Okay.  So.
> 
> The rule for reverts like this is: if you do something today, which is 
> correct usage of the API and produces an observably correct result, will that 
> be broken in the future if we fix it?  If so, then we need to revert because 
> the interface as released is unsupportable.
> 
> As it stands, we have a matrix of 4 behaviors:
> 
> 
> bytes
> text(ascii)
> text(nonascii)
> py2
> works
> works
> UnicodeDecodeError
> py3
> garbage
> works
> works
> 
> This... is actually... fine, surprisingly.
>  
> Given that matrix, how would this work on Python 2 and 3:
> 
> https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68
>  
> 
It wouldn't work on Python 3 yet.  But that's fine: the point is that it 
wouldn't work!  Buildbot can just block porting on that bug.

> And how would that code not have to change if a future release accommodates 
> Unicode on Python 2 or bytes on Python 3?

Because it will get broken / undefined behavior on the current implementation.  
We can always fix broken behavior!  What we can't do is fix broken behavior 
that also breaks other correct behavior or workarounds.  But in this case, 
there's a broken behavior (which we have on trunk) and a correct behavior 
(which we can implement in the future) and no way to coerce the broken behavior 
to do something valid via public API.

-glyph___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 21:35, John Santos  wrote:
> 
> Shouldn't this be "if you pass non-ascii text on py2, you'll get ..." ?

Yes. Thanks for that catch :).

-g___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread John Santos


Been lurking here, no cows in the fire, no irons in the race, or 
whatever, except wanting Twisted to be perfect and easy to use and being 
perennially confused by text encoding, but I did notice this:


On 11/22/2016 9:03 PM, Glyph Lefkowitz wrote:

[...]


Okay.  So.

The rule for reverts like this is: if you do something today, which is 
correct usage of the API and produces an observably correct result, 
will that be broken in the future if we fix it?  If so, then we need 
to revert because the interface as released is unsupportable.


As it stands, we have a matrix of 4 behaviors:



*bytes*

*text(ascii)*

*text(nonascii)*
*py2*

works

works

UnicodeDecodeError
*py3*

garbage

works

works


This... is actually... fine, surprisingly.

The /right/ thing to do is to write code that passes text all the 
time.  If you do that right now, it'll work on py3 and raise an 
exception on py2, unless it /happens/ to be ASCII, in which case it'll 
work.


If you write code that passes bytes on py3, it'll just be garbage. 
 But, we want to deprecate that anyway, and you can't get correct, 
usable behavior out of it, no matter what workarounds you stuff in; so 
it's a bug, and can be fixed like any bug.


Similarly if you pass non-ascii text on py3, you'll get a 
UnicodeDecodeError.


Shouldn't this be "if you pass non-ascii text on *py2, *you'll get ..." ?

[...]

-glyph



Pedantically yours,

--
John Santos
Evans Griffiths & Hart, Inc.
781-861-0670 ext 539

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Mark Williams
On Tuesday, November 22, 2016, Glyph Lefkowitz 
wrote:
>
>
> Okay.  So.
>
> The rule for reverts like this is: if you do something today, which is
> correct usage of the API and produces an observably correct result, will
> that be broken in the future if we fix it?  If so, then we need to revert
> because the interface as released is unsupportable.
>
> As it stands, we have a matrix of 4 behaviors:
>
>
> *bytes*
> *text(ascii)*
> *text(nonascii)*
> *py2*
> works
> works
> UnicodeDecodeError
> *py3*
> garbage
> works
> works
>
> This... is actually... fine, surprisingly.
>

Given that matrix, how would this work on Python 2 and 3:

https://github.com/buildbot/buildbot/blob/40d5dd3d101704aa8db582e306b3c6cf7921c23c/master/buildbot/reporters/irc.py#L67-L68

And how would that code not have to change if a future release accommodates
Unicode on Python 2 or bytes on Python 3?
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 20:27, Mark Williams  wrote:
> 
> On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
>> 
>> OK, this whole time I thought we were talking about a sensible application 
>> of text_type to the API, perhaps with some leniency for bytes-ish-ness on 
>> python 2.  I haven't reviewed the PR, I was just responding to the concerns 
>> as raised on the list.
> 
> Sorry - I didn't mean to steer this towards API bike shedding.
> 
>> If it's just randomly encoding on one version and not the other, and correct 
>> usage of the API depends on *users* doing 'if PY2:' in their own code, then 
>> perhaps Mark's concern is indeed well-founded and we should roll it back 
>> before 16.6.
>> 
> 
> Tristan's exactly right.  Furthermore, if we decide to make IRCClient
> call its various command methods with unicode strings on Python 2,
> we'll be breaking backwards compatibility.  This is what I meant when
> I wrote:
> 
> On Nov 20, 2016, at 19:35, Mark Williams  wrote:
>> 
>> Yes.  Here's the lede: IRCClient should deal in bytes and we should
>> introduce a ProtocolWrapper-like thing that encodes and decodes
>> command prefixes and parameters.  It should implement an interface,
>> and we can start with an implementation that only knows about UTF-8.
>> The obvious advantage of this is that you can more easily write
>> IRCClients that work on both Python 2 and 3.
>> 
> 
> But it totally wasn't clear - sorry!
> 
> Of course, I also want IRC client implementation that lets me get at
> bytes, but that's a discussion I'll move to a new thread.
> 
> Given the inconsistency between Python 2 and Python 3, do we proceed
> with the revert?

Okay.  So.

The rule for reverts like this is: if you do something today, which is correct 
usage of the API and produces an observably correct result, will that be broken 
in the future if we fix it?  If so, then we need to revert because the 
interface as released is unsupportable.

As it stands, we have a matrix of 4 behaviors:


bytes
text(ascii)
text(nonascii)
py2
works
works
UnicodeDecodeError
py3
garbage
works
works

This... is actually... fine, surprisingly.

The right thing to do is to write code that passes text all the time.  If you 
do that right now, it'll work on py3 and raise an exception on py2, unless it 
happens to be ASCII, in which case it'll work.

If you write code that passes bytes on py3, it'll just be garbage.  But, we 
want to deprecate that anyway, and you can't get correct, usable behavior out 
of it, no matter what workarounds you stuff in; so it's a bug, and can be fixed 
like any bug.

Similarly if you pass non-ascii text on py3, you'll get a UnicodeDecodeError.

This is not a good situation, but it's totally fixable without breaking the 
interface.  We just fix the py2 version to accept text_type as well, and if 
Mark sneaks in a patch that makes py3 do the right thing with bytes, well, I 
don't know that I can stop him.

More importantly, it would probably be a smaller change to fix the methods (we 
could even fix them one at a time; say, action, join, etc) than to un-port and 
re-port the whole thing.

So: yes, it's broken, and in a worse way than I thought.  To get it to the 
point where we can actually implement logic consistently between two versions, 
we need to add a flag to IRCClient's constructor which is default-false on py2 
and default-true on py3 which says "give me text", so that callbacks like 
privmsg and joined can start receiving text_type on py2 as well as py3; right 
now it has to receive str because they've previously received str.  But that's 
a separate issue.

I am open to the idea that I have evaluated this incorrectly though, since this 
has been possibly the most confusing change since 
https://twistedmatrix.com/trac/ticket/411 
.  But as of right now I still think 
we shouldn't revert.

-glyph

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Mark Williams
On Tue, Nov 22, 2016 at 06:31:45PM -0500, Glyph Lefkowitz wrote:
>
> OK, this whole time I thought we were talking about a sensible application of 
> text_type to the API, perhaps with some leniency for bytes-ish-ness on python 
> 2.  I haven't reviewed the PR, I was just responding to the concerns as 
> raised on the list.

Sorry - I didn't mean to steer this towards API bike shedding.

> If it's just randomly encoding on one version and not the other, and correct 
> usage of the API depends on *users* doing 'if PY2:' in their own code, then 
> perhaps Mark's concern is indeed well-founded and we should roll it back 
> before 16.6.
>

Tristan's exactly right.  Furthermore, if we decide to make IRCClient
call its various command methods with unicode strings on Python 2,
we'll be breaking backwards compatibility.  This is what I meant when
I wrote:

On Nov 20, 2016, at 19:35, Mark Williams  wrote:
>
> Yes.  Here's the lede: IRCClient should deal in bytes and we should
> introduce a ProtocolWrapper-like thing that encodes and decodes
> command prefixes and parameters.  It should implement an interface,
> and we can start with an implementation that only knows about UTF-8.
> The obvious advantage of this is that you can more easily write
> IRCClients that work on both Python 2 and 3.
>

But it totally wasn't clear - sorry!

Of course, I also want IRC client implementation that lets me get at
bytes, but that's a discussion I'll move to a new thread.

Given the inconsistency between Python 2 and Python 3, do we proceed
with the revert?

-Mark

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 22, 2016, at 18:27, Tristan Seligmann  wrote:
> 
> On Wed, 23 Nov 2016 at 01:26 Tristan Seligmann  > wrote:
> if PY3:
> 
> Argh, the above should be if PY2 of course.

OK, this whole time I thought we were talking about a sensible application of 
text_type to the API, perhaps with some leniency for bytes-ish-ness on python 
2.  I haven't reviewed the PR, I was just responding to the concerns as raised 
on the list.

If it's just randomly encoding on one version and not the other, and correct 
usage of the API depends on *users* doing 'if PY2:' in their own code, then 
perhaps Mark's concern is indeed well-founded and we should roll it back before 
16.6.

-glyph

___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Wed, 23 Nov 2016 at 01:14 Tristan Seligmann 
wrote:

> On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz 
> wrote:
>
>
> This is the part that I'm worried about.  It kinda seems like we're moving
> toward "native string" being the type used in IRCClient, and *that* is
> capital-W Wrong.  Native strings are for Python-native types only, i.e.
> docstrings and method names.
>
>
> Unless I'm misunderstanding, we're not "moving towards" it, we have *already
> arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode)
> on Python 3. Even if we want a unicode API, having it only exist on Python
> 3 seems incredibly confusing from a user standpoint, and would appear to
> require some absurd contortions to write client code that behaves
> approximately the same on both Python 2 and 3.
>

For example, as far as I can tell, the only way to write code to join a
channel named #tëst (UTF-8 encoded) is:

channel = u'#tëst'
if PY3:
channel = channel.encode('utf-8')
client.join(channel)

On Python 3, client.join(b'#t\xc3\xab') will try to send JOIN b'#t\xc3\xab',
which is garbage, whereas on Python 2, client.join(u'#t\xebst') will
produce a UnicodeEncodeError.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Wed, 23 Nov 2016 at 01:26 Tristan Seligmann 
wrote:

> if PY3:
>

Argh, the above should be if PY2 of course.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Tristan Seligmann
On Tue, 22 Nov 2016 at 23:37 Glyph Lefkowitz 
wrote:

>
> This is the part that I'm worried about.  It kinda seems like we're moving
> toward "native string" being the type used in IRCClient, and *that* is
> capital-W Wrong.  Native strings are for Python-native types only, i.e.
> docstrings and method names.
>

Unless I'm misunderstanding, we're not "moving towards" it, we have *already
arrived*: IRCClient deals in str (bytes) on Python 2, and str (unicode) on
Python 3. Even if we want a unicode API, having it only exist on Python 3
seems incredibly confusing from a user standpoint, and would appear to
require some absurd contortions to write client code that behaves
approximately the same on both Python 2 and 3.
___
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python


Re: [Twisted-Python] Twisted 16.6.0rc1 Release Candidate Announcement

2016-11-22 Thread Glyph Lefkowitz

> On Nov 20, 2016, at 19:35, Mark Williams  wrote:
> 
> On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
>> 
>> "doesn't work" is a pretty black-and-white assessment.  Are you anticipating 
>> a problem with the way the interface is specified that it can't be easily 
>> changed?
>> 
> 
> Yes.  Here's the lede:

Thank you for summarizing!  Point by point, here's my position:

> IRCClient should deal in bytes and we should introduce a ProtocolWrapper-like 
> thing that encodes and decodes
> command prefixes and parameters.

I disagree.  Any user-facing API should deal in unicode objects.  (There is one 
caveat here; there really should be a separate layer for dealing with text; 
IRCClient being a subclassing-based API pollutes the whole issue.  But that API 
shouldn't be public, so this is largely minutae; the "right" answer here has 
nothing to do with bytes or text and everything to do with adopting .)

> It should implement an interface, and we can start with an implementation 
> that only knows about UTF-8.

We should have the implementation initially know about UTF-8, yes.

> The obvious advantage of this is that you can more easily write IRCClients 
> that work on both Python 2 and 3.

This is the part that I'm worried about.  It kinda seems like we're moving 
toward "native string" being the type used in IRCClient, and that is capital-W 
Wrong.  Native strings are for Python-native types only, i.e. docstrings and 
method names.


> I'm also not entirely sure of the consequences of this interface
> change.  I think it deserves more thought before it becomes an API
> that we have to support.  This is the primary reason I opened the
> revert PR.

One of the things that's informing my decision is that IRCClient is already an 
incredibly ill-defined API that probably needs to be deprecated and overhauled 
at some point.  However, in the intervening (what will almost certainly be a) 
decade, I'd like it to work on Python 3.

> I'm more precisely worried about the fact that the implementation
> raises a decoding exception that cannot be handled in user code when
> it receives non-UTF-8 messages,

The right way to deal with this is twofold:

Add the ability to specify both the "encoding" and the "errors" of the relevant 
codec >, so that we 
can choose error handling strategies.
(potentially, if you have very nuanced requirements for dealing with weird 
encodings) write a codec that logs and handles its own errors.  (We probably 
shouldn't be logging a traceback for encoding problems regardless, if it's 
UnicodeDecodeError.  But that's something that can easily be fixed in 
subsequent releases as well)

> and the fact that the line length checks occur prior to encoding, ensuring 
> mid-codepoint truncation. These issues also contributed to my revert.

Line length checks are a super interesting example because I think they also 
illustrate my concerns as well.

To properly do message-splitting (which is why we're checking line length), you 
have to:

check the length in octets (because it's actually a message-length limit in 
octets, not a line-length limit in characters)
split the textual representation - ideally somewhere relevant like a word 
break, which you can only detect in text!
try encoding again and ensure that the encoded representation is the correct 
length, repeating if necessary.

This is an implementation-level bug though, not an interface-level one, so I'm 
also comfortable fixing this bug in the future.

>> My points are, separately:
>> 
>> IRC is text. It's nonsensical to process it as bytes, because you can't 
>> process it as bytes.  This is separate from the question of "what encoding 
>> is IRC".
> 
> It's nonsensical that it be finally presented to a human as raw bytes.
> I'm advocating for the decision to be made as late as possible.  That
> doesn't mean we can't provide an easy-to-use recoding client that we
> encourage people to turn to first.

You can't process it as bytes either, though.  In some cases you think you can, 
but then you get mid-codepoint truncation :-).

> But we can't have *a* fallback encoding.  My encoding detector program
> indicates that latin-1 is the second most popular encoding for
> European IRC servers, but Russian servers I sampled (not in
> netsplit.de's top 10) used a variety of Cyrillic encodings.

If you really want to do something this sophisticated (and, I should note: no 
other IRC clients or bots I'm aware of do, so I think you've got an 
unrealistically tight set of requirements) then you can just write your own 
single codec that composes a bunch of others, and install it.  Python's 
encoding system is extensible for exactly this reason :).

> I also want to enable arbitrary recovery strategies for bad encodings.

This is totally not an IRC-specific thing though :-).

> For