Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-20 Thread Steven D'Aprano
On Tue, Nov 19, 2013 at 05:28:48PM -0800, Jim J. Jewett wrote:
 (Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote:
   Serhiy Storchaka wrote:
If the transform() method will be added, I prefer to have only
one transformation method and specify a direction by the
transformation name (bzip2/unbzip2).
 Me too.  Until I consider special cases like compress, or lower,
 and realize that there are enough special cases to become a major wart
 if generic transforms ever became popular.  

I'm not sure I understand this comment. Why are compress and lower 
special cases? If there's a compress codec, presumably there'll be an 
uncompress or expand that reverses it. In the case of lower, it's 
not losslessly reversable, but there's certainly a reverse 
transformation, upper.

Some transformations are their own reverse, e.g. rot13. In that case, 
there's no need for an unrot13 codec, since applying it twice undoes 

  People think about these transformations as en- or de-coding, not
  transforming, most of the time.  Even for a transformation that is
  an involution (eg, rot13), people have an very clear idea of what's
  encoded and what's not, and they are going to prefer the names
  encode and decode for these (generic) operations in many cases.
 I think this is one of the major stumbling blocks with unicode.
 I originally disagreed strongly with what Stephen wrote -- but then
 I realized that all my counterexamples involved unicode text.

Counterexamples to what? Again, I'm afraid I can't really understand 
what point you're trying to make here. Perhaps an explicit 
counterexample, and an explicit statement of what you're disagreeing 
with (e.g. I disagree that people have a clear example of what's 
encoded and what's not) will help.

 But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't
 seem encoded, and it doesn't make sense to decode a perfectly
 readable (ASCII) string into a sequence of code units.

Of course it is encoded. There's nothing a-like about the byte 0x61, 
byte 0x2E is nothing like a period, and there is nothing about the byte 
0x0A that forces text editors to start a new line -- or should that be 
0x0D, or even possibly 0x85?

There's nothing that distinguishes the text spam from the four-byte 
integer 1936744813 (0x7370616d in hex) except the semantics that we 
grant it, and that includes an implicit transformation 0x73 - s, 

Reading this may help:‎

 Nor does it help that
 defines code unit as The minimal bit combination that can represent
 a unit of encoded text for processing or interchange. The Unicode
 Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
 units in the UTF-16 encoding form, and 32-bit code units in the UTF-32
 encoding form. (See definition D77 in Section 3.9, Unicode Encoding

I agree that the official Unicode glossary is unfortunately confusing. 
It has a huge amount of information, often with confusingly similar 
terminology (code points and code units are, in a sense, opposites), and 
it's quite hard for beginners to Unicode to make sense of it all.

 I have to read that very carefully to avoid mentally translating it
 into Code Units are *en*coded, 

Code units *are* encoded, in the sense that we say a burger is cooked. 
Take a raw meat patty and cook it, and you get a burger. Similarly, code 
units are the product of an encoding process, hence have been encoded.

Code points (think of them as characters, modulo a few technicalities) 
are encoded *into* code units, which are bytes. Which code units you 
get depend on the encoding form you use, i.e. the codec.

If you start with the character a, and apply the UTF-8 encoding, 
you get a single 8-bit (one byte) code unit, 0x61. If you apply the 
UTF-16 (big endian) encoding, you get a single 16-bit (two bytes) 
code unit, 0x0061. If you apply UTF-32be codec, you get a single 32-bit 
(four bytes) code unit, 0x0061.

 and there are lots of different
 complicated encodings that I wouldn't use unless I were doing special
 processing or interchange.

Very few of those encodings are Unicode. With the exception of a small 
handful of UTF-* codecs, and maybe one or two others, the vast majority 
are legacy encodings from the Bad Old Days when just about every 
computer had it's own distinct character set, or sets. If you're a 
Windows user, the non-UTF codecs (all the Latin-whatever codecs, Big5, 
cp-whatever, koi8-whatever, there are dozens of them) are basically old 
Windows code pages and the equivalent from other computer systems.

And yes, it is best to avoid them like the plague except when you need 
them for interoperability with legacy data.

 If I'm not using the network, or if my
 interchange format already looks like readable ASCII, then unicode
 sure sounds like a complication.

It's not, not compared to the Bad Old Days. If 

Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-20 Thread Chris Angelico
On Wed, Nov 20, 2013 at 11:03 PM, Steven D'Aprano wrote:
 I *will* get confused over which
 direction is encoding and which is decoding. (Removing .decode()
 from the (unicode) str type in 3 does help a lot, if I have a Python 3
 interpreter running to check against.)

 It took me a long time to learn that text encodes to bytes, and bytes
 decode back to text. Using Python 3 really helped with that.

Rule of thumb: Stuff gets encoded for transmission/storage and decoded
for usage.

That covers encryption (you transmit the coded form and read the
decoded), compression (you store the tighter form and use the
expanded), Unicode (you store bytes, you work with characters), and
quite a few others. I don't know that it's an iron-clad rule (though I
can't off-hand think of a contrary example), but it's certainly an
easy way to remember a lot of the encode/decode pairs.

Python-Dev mailing list

Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-20 Thread Walter Dörwald

On 20.11.13 02:28, Jim J. Jewett wrote:

Instead of relying on introspection of .decodes_to and .encodes_to, it
would be useful to have charsetcodecs and tranformcodecs as entirely
different modules, with their own separate registries.  I will even
note that the existing help(codecs) seems more appropriate for
charsetcodecs than it does for the current conjoined module.

I don't understand how a registry of transformation functions would 
simplify code. Without the transform() method I would write:

import binascii

With the transform() method I should be able to write:


However how does the hex transformer get registered in the registry? If 
the hex transformer is not part of the stdlib, there must be some code 
that does the registration, but to get that code to execute, I'd have to 
import a module, so we're back to square one, as I'd have to write:

import hex_transformer

A way around this would be some kind of import magic, but is this really 
neccessary to be able to avoid one import statement?

Furthermore different transformation functions might have different 
additional options. Supporting those is simple when we have simple 
transformation functions: The functions has arguments, and those are 
documented where the function is documented. If we want to support 
custom options for the .transform() method, transform() would have to 
pass along *args, **kwargs to the underlying transformer. However this 
is difficult to document in a way that makes it easy to find which 
options exist for a particular transformer.


Python-Dev mailing list

Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-20 Thread Nick Coghlan
On 20 November 2013 23:38, Walter Dörwald wrote:
 On 20.11.13 02:28, Jim J. Jewett wrote:


 Instead of relying on introspection of .decodes_to and .encodes_to, it
 would be useful to have charsetcodecs and tranformcodecs as entirely
 different modules, with their own separate registries.  I will even
 note that the existing help(codecs) seems more appropriate for
 charsetcodecs than it does for the current conjoined module.

 I don't understand how a registry of transformation functions would simplify
 code. Without the transform() method I would write:

 import binascii

 With the transform() method I should be able to write:


 However how does the hex transformer get registered in the registry? If the
 hex transformer is not part of the stdlib, there must be some code that does
 the registration, but to get that code to execute, I'd have to import a
 module, so we're back to square one, as I'd have to write:

 import hex_transformer

 A way around this would be some kind of import magic, but is this really
 neccessary to be able to avoid one import statement?

Could we please move discussion of hypothetical future divisions of
the codec namespace into additional APIs to python-ideas? We don't
even have consensus to restore the codecs module to parity with its
Python 2 functionality at this point, let alone agreement on adding
more convenience APIs for functionality that we have core developers
arguing shouldn't be restored at all.


Nick Coghlan   |   |   Brisbane, Australia
Python-Dev mailing list

[Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-19 Thread Jim J. Jewett

(Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote:

  Serhiy Storchaka wrote:

   If the transform() method will be added, I prefer to have only
   one transformation method and specify a direction by the
   transformation name (bzip2/unbzip2).

Me too.  Until I consider special cases like compress, or lower,
and realize that there are enough special cases to become a major wart
if generic transforms ever became popular.  

 People think about these transformations as en- or de-coding, not
 transforming, most of the time.  Even for a transformation that is
 an involution (eg, rot13), people have an very clear idea of what's
 encoded and what's not, and they are going to prefer the names
 encode and decode for these (generic) operations in many cases.

I think this is one of the major stumbling blocks with unicode.

I originally disagreed strongly with what Stephen wrote -- but then
I realized that all my counterexamples involved unicode text.

I can tell whether something is tarred or untarred, zipped or unzipped.

But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't
seem encoded, and it doesn't make sense to decode a perfectly
readable (ASCII) string into a sequence of code units.

Nor does it help that
defines code unit as The minimal bit combination that can represent
a unit of encoded text for processing or interchange. The Unicode
Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code
units in the UTF-16 encoding form, and 32-bit code units in the UTF-32
encoding form. (See definition D77 in Section 3.9, Unicode Encoding

I have to read that very carefully to avoid mentally translating it
into Code Units are *en*coded, and there are lots of different
complicated encodings that I wouldn't use unless I were doing special
processing or interchange.  If I'm not using the network, or if my
interchange format already looks like readable ASCII, then unicode
sure sounds like a complication.  I *will* get confused over which
direction is encoding and which is decoding. (Removing .decode()
from the (unicode) str type in 3 does help a lot, if I have a Python 3
interpreter running to check against.)

I'm not sure exactly what implications the above has, but it certainly
supports separating the Text Processing from the generic codecs, both
in the documentation and in any potential new methods.

Instead of relying on introspection of .decodes_to and .encodes_to, it
would be useful to have charsetcodecs and tranformcodecs as entirely
different modules, with their own separate registries.  I will even
note that the existing help(codecs) seems more appropriate for
charsetcodecs than it does for the current conjoined module.



If there are still threading problems with my replies, please 
email me with details, so that I can try to resolve them.  -jJ

Python-Dev mailing list

Re: [Python-Dev] Which direction is UnTransform? / Unicode is different

2013-11-19 Thread Greg Ewing

My thought on this for the day, for what it's worth:
Anything that doesn't have directions clearly identifiable
as encoding and decoding maybe shouldn't be called
a codec?

Python-Dev mailing list