On Fri, Nov 18, 2016 at 05:36:16PM -0800, Glyph Lefkowitz wrote:
>
> "doesn't work" is a pretty black-and-white assessment.  Are you anticipating 
> a problem with the way the interface is specified that it can't be easily 
> changed?
>

Yes.  Here's the lede: IRCClient should deal in bytes and we should
introduce a ProtocolWrapper-like thing that encodes and decodes
command prefixes and parameters.  It should implement an interface,
and we can start with an implementation that only knows about UTF-8.
The obvious advantage of this is that you can more easily write
IRCClients that work on both Python 2 and 3.  I'll attempt to explain
others in the rest of this email.

>
> I should say up front here that I think I was being too emphatic in my 
> support for UTF-8.
>

Phew!

>
> Test regressions are listed because they're unambiguously cause for a revert; 
> "undesirable" is intentionally vague because we might decide to revert a 
> thing for no reason.  I guess opening a PR for a discussion like this is 
> reasonable.
>

Good to know!

> This could be considered an incompatible interface change; I'm honestly not 
> sure about the exact type signatures of various methods to say whether it is 
> or not.
>

I'm also not entirely sure of the consequences of this interface
change.  I think it deserves more thought before it becomes an API
that we have to support.  This is the primary reason I opened the
revert PR.

I'm more precisely worried about the fact that the implementation
raises a decoding exception that cannot be handled in user code when
it receives non-UTF-8 messages, and the fact that the line length
checks occur prior to encoding, ensuring mid-codepoint truncation.
These issues also contributed to my revert.

>
> My points are, separately:
>
> IRC is text. It's nonsensical to process it as bytes, because you can't 
> process it as bytes.  This is separate from the question of "what encoding is 
> IRC".

It's nonsensical that it be finally presented to a human as raw bytes.
I'm advocating for the decision to be made as late as possible.  That
doesn't mean we can't provide an easy-to-use recoding client that we
encourage people to turn to first.

> UTF-8 is good. There should be gradual social pressure to use UTF-8 
> everywhere (I'm a fan of http://utf8everywhere.org 
> <http://utf8everywhere.org/>).  This is especially true in protocols like IRC 
> and filenames where there's no mechanism to specify an encoding so that it 
> can be correctly decoded.  Therefore:
> an initial release which features UTF-8 only is fine; therefore there's no 
> need to do a revert.
> defaulting to UTF-8 is reasonable for the forseeable future; users should 
> only change this if they know that they want something unusual.
> IRC is an incompatible and broken wasteland; thanks to your quantitative 
> research we know exactly how broken.  Therefore:
> "support alternate encodings" is a valuable feature.  Supporting point 2.1, 
> this feature can be added on at any later point, making a revert of the 
> present implementation unnecessary.
> We can, and should, just go ahead and add support for alternate (per-server, 
> per-channel, per-user) default and fallback encodings.
> We should always have a fallback encoding, since blowing up on "invalid" data 
> on a protocol where there's no standard to say what is or isn't valid doesn't 
> seem very helpful.
>

I appreciate the consistency of this, and agree the documented
preference should be a client implementation that assumes UTF-8.  But
we can't have *a* fallback encoding.  My encoding detector program
indicates that latin-1 is the second most popular encoding for
European IRC servers, but Russian servers I sampled (not in
netsplit.de's top 10) used a variety of Cyrillic encodings.

I also want to enable arbitrary recovery strategies for bad encodings.
For instance, in the case that an IRC client or server truncates a
code point at a line boundary, it might be the right idea to binary
search until the invalid byte sequence is found, and then exclude it.
It might be the right idea to buffer the message for a time in the
hopes that the codepoint got split over two lines.

And what if somebody wants to run another encoding survey?

I don't expect most users to do any of that, but *I* certainly want to
without having to copy and paste a bunch of code.

> >
> > When I received Arabic PDFs on a FAT16 USB drive with filenames in
> > CP1256, I had to switch mlterm to that particular code page to read
> > the directory listings so I could use convmv to convert them to UTF-8.
>
> There is no question that your life has been hard, and that a wide array of 
> people have made bad decisions that contribute to your difficulties. :-)
>

My real point was that dealing with bad encodings is not theoretical.
Nobody knew the encoding, by the way; they just knew the USB drive
worked for some of them and not others, and were resulting to printing
things out or taking screen shots.

That's the situation opinionated software with monolithic abstractions
creates.  People *will* find workarounds that are terrible for a bunch
of reasons.  I can vouch for the utility of tools that decide on
encodings as late as possible.

Note that I'm not asking that we be everything to all people, but
rather that we allow people the option of dealing with the IRC
encoding disaster the way they see fit.

>
> >
>
> But, Linux's FAT16 driver has decided that.
>
> The correct way to solve your problem with current Linux (I don't know if 
> this was possible at the time) would be to address it with mount, not special 
> user-space software.  Specifically, I think it would be something like:
>
> mount -t fat -o fat=16,iocharset=utf-8,codepage=1256 
> /dev/disk/by-label/arabic.msdos /media/arabic.msdos
>
> Now all your GTK+ software works, too, because you're not trying to reconcile 
> your legacy format support at the application level.
>

I don't remember either.  But, now the driver *allows* me to do that
without requiring it, and also allows me to mount the file system so
that the paths are exposed as bytes.  Since nobody knew the encoding,
that was essential to letting me use mlterm to determine it.  Nowadays
I'd probably use chardet but would still need the raw bytes.

And as far as I know code point sequence truncation can also occur on
FAT16/32 partitions.  In the event of such truncation the automatic
decoding would only prevent me from mounting the partition.  I'm
thankful that the implementation allows me to choose a recovery
strategy in a very real edge case.  If it didn't, I'd have to look up
the file system's on disk format and reimplement 99% of a FAT16 driver
to get at the data.

So it's the case that raw bytes weren't useful to me when I tried to
actually read the paths, but they were super useful to me when a
perfectly reasonable assumption was wrong.  And when no encoding is
mandated, perfectly reasonable assumptions do fail and fail often.

> >  What if I want to
> > write write a bot that bridges two IRC networks?
> >
> In the current release, yes.  But in a future release: no, you can't just 
> bridge arbitrary bytes between two networks and expect them to work.  Those 
> networks (or channels, or users) might have different implicit encoding 
> rules; which, by default and only by default, should be utf-8.  In a 
> multi-encoding world, you may need to transcode between them to properly 
> bridge; this is a consequence of the fact that eventually you're presenting 
> this data as text to human eyeballs.

It's true that if one channel is latin-1 and the other is MacCyrillic
that a text-only IRCClient implementation could handle this just by
allowing the user to choose an encoding.  The recoding API I'm talking
about wouldn't give you anything.  But it would help with truncation
issues and channels' topics using different encodings.

> > But none of this is actually true.  What seems to be true is that
> > non-utf-8 encodings are rarely if ever seen on Freenode, and sometimes
> > to regularly seen on many other IRC servers.  These encodings are
> > certainly used.
>
> I can't really parse you here - are you saying that each network more or less 
> sticks to one encoding?
>

Not quite - I meant that in my survey, I saw no latin-1 on Freenode,
but that may be because they decided I was abusing the network early
on in my attempt to list and join channel.

But on other networks I saw a lot of different encodings, used across
different channels, so that the channel list contained topics encoded
in many different 8-bit encodings.

>
> Sorry, my statement you were responding to here was way too strong.  What I 
> meant to say here is that long term there is no way to get a "right" answer 
> in this ecosystem, so "UTF-8 is the only correct answer" is the only 
> direction we can push in to actually make things work reasonably by default 
> an increasing proportion of the time.  For the forseeable future, adding the 
> ability to cope with other encodings (encoding a fallback to latin-1 so that 
> you can at least do demojibakefication manually after copy/pasting) is 
> something a general-purpose IRC library absolutely needs.  This is why every 
> client has an "encoding" selection menu, too.
>

For what it's worth, I want to make it easy to use UTF-8.  I just
don't want to make it hard to use an encoding that's *not* UTF-8.

> > It makes more sense to have an implementation that parses protocol
> > elements as bytes and provides a bytes API.  It's fine to also provide
> > a decoded text API, but not to the exclusion of bytes.
>
> This is the point where I think we diverge.  I don't think adding a bytes API 
> actually adds any value.  Trying to process the contents of IRC as bytes as 
> any way leads to inevitable failures ("line" truncation midway through a 
> UTF-8 escape sequence for example).
>

This is precisely where we disagree.  As I described above, I can
think of a couple ways to handle mid-codepoint truncation.  A
Twisted-based IRC client should have the option to implement its own.
The end result would still be text (or at least an informative log
message.)

I think the best way to handle this is to have a bytes-only IRC client
that can then be wrapped with something that decodes prefixes and
parameters.  We can provide a UTF-8 recoder that people are encouraged
to use, and an interface that allows implementers to choose their own
encoding strategy.

I don't think it can be a ProtocolWrapper, because it'll need to know
about the particulars of IRCClient.  That means I don't have a clear
idea of the interface yet.  Until I do, I'd prefer we ship something
that implements the RFC and allows people to do handling encoding the
way they see fit.  I will say I'm happy to take a stab at a recoder.
But it can't be written with IRCClient as it stands now and would
certainly be done in a separate PR.

Shipping what we have now will mean we're putting bugs out there (see
the line length issues called out in the ticket) and an interface I
think we haven't thought through, but that certainly limits what IRC
protocol messages you can receive.

(Also - I don't think any multibyte UTF-8 sequence can contain a byte
<= 127, so that it can't be truncated by ASCII-only code.  This of
course isn't true for fixed-width encodings.  '\n\n' is a totally
valid UTF-16 sequence.)

> So, the thing IRC is transmitting is text.  The way it's transmitting it is 
> poorly specified and will need manual configurability hooks to specify 
> encoding information, probably forever, and perhaps even to guess it 
> (although "encoding=chardet" would be nice).  I agree that just saying "UTF-8 
> or GTFO" is not a sustainable approach at all.  "UTF-8 or have a bad time 
> with this fiddly customization API and config file" is fine, because anyone 
> wanting something else is probably already having a bad time.
>
> If you are engaging in a real abuse of the IRC protocol and you're treating 
> it as an 8-bit clean stream to send some escaped binary data through (like a 
> video stream, something like that), well, that's what the 'charmap' alias of 
> 'latin-1' is for :-).
>

I guess charmap could be used to implement the recovery scheme I keep
talking about, but then we'd be telling people to work out the
recoding interaction between IRCClient and their own implementation.
I'd like to provide a defined way of doing so eventually.

> So... have I sold you?
>

On default UTF-8?  Absolutely!  But I don't know exactly the way to do
it, so I'd rather provide a Python 3 port that actually implements the
protocol, and then work out a nice recoding API.

Thanks for taking the time to talk through this.  I appreciate it!

-Mark

_______________________________________________
Twisted-Python mailing list
Twisted-Python@twistedmatrix.com
http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

Reply via email to