Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Ron Adam
Greg Ewing wrote:
> Ron Adam wrote:
> 
>> This would apply to codecs that 
>> could return either bytes or strings, or strings or unicode, or bytes or 
>> unicode.
> 
> I'd need to see some concrete examples of such codecs
> before being convinced that they exist, or that they
> couldn't just as well return a fixed type that you
> then transform to what you want.

I think text some codecs that currently return 'ascii' encoded text 
would be candidates.  If you use u'abc'.encode('rot13') you get an ascii 
string back and not a unicode string. And if you use decode to get back, 
you don't get the original unicode back, but an ascii representation of 
the original you then need to decode to unicode.

> I suspect that said transformation would involve some
> further encoding or decoding, in which case you really
> have more than one codec.

Yes, I can see that.

So the following are probable better reasons to specify the type.

Codecs are very close to types and they quite often result in a type 
change, having the change visible in the code adds to overall 
readability.  This is probably my main desire for this.

There is another reason for being explicit about types with codecs. If 
you store the codecs with a tuple of attributes as the keys, (name, 
in_type, out_type), then it makes it possible to look up the codec with 
the correct behavior and then just do it.

The alternative is to test the input, try it, then test the output.  The 
look up doesn't add much overhead, but does adds safety.  Codecs don't 
seem to be the type of thing where you will want to be able to pass a 
wide variety of objects into.  So a narrow slot is probably preferable 
to a wide one here.

In cases where a codec might be useful in more than one combination of 
types, it could have an entry for each valid combination in the lookup 
table.  The codec lookup also validates the desired operation for nearly 
free.  Of course, the data will need to be valid as well. ;-)


Cheers,
   Ron



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Greg Ewing
Ron Adam wrote:

> This would apply to codecs that 
> could return either bytes or strings, or strings or unicode, or bytes or 
> unicode.

I'd need to see some concrete examples of such codecs
before being convinced that they exist, or that they
couldn't just as well return a fixed type that you
then transform to what you want.

I suspect that said transformation would involve some
further encoding or decoding, in which case you really
have more than one codec.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Greg Ewing
Stephen J. Turnbull wrote:

> Doesn't that make base64 non-text by analogy to other "look but don't
> touch" strings like a .gz or vmlinuz?

No, because I can take a piece of base64 encoded data
and use a text editor to manually paste it in with some
other text (e.g. a plain-text (not MIME) mail message).
Then I can mail it to someone, or send it by text-mode
ftp, or translate it from Unix to MSDOS line endings and
give it to a Windows user, or translate it into EBCDIC
and give it to someone who has an IBM mainframe, etc,
etc. And the person at the other end can use their text
editor to manually extract it and decode it and recover
the original data.

I can't do any of those directly with a .gz file or
vmlinuz.

I'm not just making those uses up, BTW. It's not very
long ago people used to do things like that all the
time with uuencode, binhex, etc -- because mail and
news at the time were strictly text channels. They
still are, really -- otherwise we wouldn't be using
anything as hairy as MIME, we'd just mail our binary
files as-is.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-03 Thread Ron Adam
Greg Ewing wrote:
> Ron Adam wrote:
> 
>> This uses syntax to determine the direction of encoding.  It would be 
>> easier and clearer to just require two arguments or a tuple.
>>
>>   u = unicode(b, 'encode', 'base64')
>>   b = bytes(u, 'decode', 'base64')
> 
> The point of the exercise was to avoid using the terms
> 'encode' and 'decode' entirely, since some people claim
> to be confused by them.

Yes, that was what I was trying for with the tounicode, tostring 
(tobyte) suggestion, but the direction could become ambiguous as you 
pointed out.

The constructors above have 4 data items implied:
  1: The source object which includes the source type and data
  2: The codec to use
  3: The direction of the operation
  4: The destination type (determined by the constructor used)

There isn't any ambiguity other than when to use encode or decode, but 
in this case that really is a documentation problem because there is no 
ambiguities in this form.  Everything is explicit.

Another version of the above was pointed out to me off line that might 
be preferable.

   u = unicode(b, encode='base64')
   b = bytes(u, decode='base64')

Which would also work with the tostring and tounicode methods.

   u = b.tounicode(decode='base64')
   b = u.tobytes(incode='base64')


> If we're going to continue to use 'encode' and 'decode',
> why not just make them functions:
> 
>b = encode(u, 'utf-8')
>u = decode(b, 'utf-8')

 >>> import codecs
 >>> codecs.decode('abc', 'ascii')
u'abc'

There's that time machine again. ;-)

> In the case of Unicode encodings, if you get them
> backwards you'll get a type error.
> 
> The advantage of using functions over methods or
> constructor arguments is that they can be applied
> uniformly to any input and output types.

If codecs are to be more general, then there may be time when the 
returned type needs to be specified.  This would apply to codecs that 
could return either bytes or strings, or strings or unicode, or bytes or 
unicode.  Some inputs may equally work with more than one output type. 
Of course, the answer in these cases may be to just 'know' what you will 
get, and then convert it to what you want.

Cheers,
Ron


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> (BTW, doesn't the fact that you *can* load an XML file into
Greg> what we call a "text editor" say something?)

Why not answer that question for yourself, and then turn that answer
into a description of "text semantics"?

For me, it says that, just like a gzipped file or the Linux kernel, I
can load an XML file into a text editor.  But unlike the .gz or
vmlinuz, I can easily find many useful things to do to the XML string
in the text editor.

Doesn't that make base64 non-text by analogy to other "look but don't
touch" strings like a .gz or vmlinuz?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Greg Ewing
Stephen J. Turnbull wrote:

> What you presumably meant was "what would you consider the proper type
> for (P)CDATA?"

No, I mean the whole thing, including all the <...> tags
etc. Like you see when you load an XML file into a text
editor. (BTW, doesn't the fact that you *can* load an
XML file into what we call a "text editor" say something?)

> nobody but authors of
> wire drivers[2] and introspective code will need to _explicitly_ call
> .encode('base64').

Even a wire driver writer will only need it if he's
trying to turn a text wire into a binary wire, as
far as I can see.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Greg Ewing
Ron Adam wrote:

> This uses syntax to determine the direction of encoding.  It would be 
> easier and clearer to just require two arguments or a tuple.
> 
>   u = unicode(b, 'encode', 'base64')
>   b = bytes(u, 'decode', 'base64')

The point of the exercise was to avoid using the terms
'encode' and 'decode' entirely, since some people claim
to be confused by them.

While I succeeded in that, I concede that the result
isn't particularly intuitive and is arguably even more
confusing.

If we're going to continue to use 'encode' and 'decode',
why not just make them functions:

   b = encode(u, 'utf-8')
   u = decode(b, 'utf-8')

In the case of Unicode encodings, if you get them
backwards you'll get a type error.

The advantage of using functions over methods or
constructor arguments is that they can be applied
uniformly to any input and output types.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Delaney, Timothy (Tim)
Delaney, Timothy (Tim) wrote:

> unicode.frombytes(cls, encoding)

unicode.frombytes(encoding) ...

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Delaney, Timothy (Tim)
Just van Rossum wrote:

> My preference for bytes -> unicode -> bytes API would be this:
> 
> u = unicode(b, "utf8")  # just like we have now
> b = u.tobytes("utf8")   # like u.encode(), but being explicit
> # about the resulting type

+1 - I was going to write exactly the same thing. The `bytes` type
shouldn't know anything about unicode - conversions between bytes and
unicode is entirely the responsibility of the unicode type.

Alternatively, rather than as part of the constructor (though that seems
the obvious place) some people may prefer a classmethod:

unicode.frombytes(cls, encoding)

It gives a nice symmetry.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> But the base64 string itself *does* have text semantics.

What do you mean by that?  The strings of abstract "characters"
defined by RFC 3548 cannot be concatenated in general, they may only
be split at 4-character intervals, they can't be reliably searched as
text for a given octet or substring of the underlying binary object,
and deletion or insertion of octets can't be done without decoding and
re-encoding the whole string.  And of course humans can make neither
head nor tail of them in most cases.  The only useful semantics that
they have is "you can apply the base64 decoder" to them.

In other words, by far the most important effect of endowing that
string with "text semantics" is to force programmers to remember not
to use them.

Do you really mean to call that "text semantics"?

Greg> To me this is no different than using a string of decimal
Greg> digit characters to represent an integer, or a string of
Greg> hexadecimal digit characters to represent a bit
Greg> pattern. Would you say that those are not text, either?

"No different"?  OK, I'll take you at your word.

T2YgY291cnNlIEkgd291bGQgY29uc2lkZXIgdGhvc2UgdGV4dC4gIFRoZXkncmUgaHVtYW4t
cmVhZGFibGUu

Greg> What about XML? What would you consider the proper data type
Greg> for an XML document to be inside a Python program -- bytes
Greg> or text?

Neither.  If I must chose one of those ... well, "I know I have a
choice of programming languages, and I won't be using Python for this
task."  Fortunately, there's ElementTree.

What you presumably meant was "what would you consider the proper type
for (P)CDATA?"  And my answer is "text" for text, and "bytes" for
binary data (eg, image or audio).  Let ElementTree handle the wire
format: if an Element's text attribute has type "bytes", convert to
base64 and then to the appropriate coded character set for the
channel.  I don't wanna know about the content transfer encoding, and
I should have no need to.

Greg> You seem to want to reserve the term "text" for data that
Greg> doesn't ever have to be understood even a little bit by a
Greg> computer program, but that seems far too restrictive to me,
Greg> and a long way from established usage.

What I want to reserve "text" for is data streams that nonprogrammer
humans might want to manipulate with pencil, paper, scissors, and
paste, or programmers with re and text[n:m] = text2.  I have no
objection to computers using it, too, and even asking us humans to
respect some restrictions on the use of [:]= and +.  But to tell us to
give up those operations entirely makes it into non-text IMO.

Greg> [The] assumption [that the channel is ASCII-compatible] could
Greg> be very wrong.  What happens if it turns out they really need
Greg> to be encoded as UTF-16, or as EBCDIC?  All hell breaks
Greg> loose, as far as I can see, unless the programmer has kept
Greg> very firmly in mind that there is an implicit ASCII encoding
Greg> involved.

Greg> It's exactly to avoid the need for those kinds of mental
Greg> gymnastics

Agreed, such bookkeeping would be annoying.  But there's no _need_ for
it any way you look at it: just leave binary objects as-is until
you're ready to put them on the wire.[1]  Attach a binary-to-wire codec
to this end of the wire, and inject your data there. This puts the
responsibility where it belongs: with the author of the wire driver.
That's the point, which you already mentioned: nobody but authors of
wire drivers[2] and introspective code will need to _explicitly_ call
.encode('base64').

Greg> that Py3k will have a unified, encoding-agnostic data type
Greg> for all character strings.

Yeah, but if base64 produces character strings, Unicode becomes a
unified, encoding-agnostic data type for all data.  Just base64
everything, and now we don't need a bytes type, right?

Note that this is precisely what Emacs/MULE does (with a variable
width non-Unicode internal encoding and "base256" instead of base64),
so as demented as it may sound, it's all too historically plausible.
And it can be implemented, by accident, at the application program
level.  Why expose our users to increased risk of such trouble?


Footnotes: 
[1]  Of course you may want to manipulate the binary data, even as
text.  But who's going to use the base64 format for that purpose?

[2]  I mean to include those who are writing the git.object_id(),
PGP_key.fingerprint(), and ElementTree.write() methods.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 

Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Josiah Carlson

Just van Rossum <[EMAIL PROTECTED]> wrote:
> 
> Ron Adam wrote:
> 
> > Josiah Carlson wrote:
> > > Greg Ewing <[EMAIL PROTECTED]> wrote:
> > >>u = unicode(b)
> > >>u = unicode(b, 'utf8')
> > >>b = bytes['utf8'](u)
> > >>u = unicode['base64'](b)   # encoding
> > >>b = bytes(u, 'base64') # decoding
> > >>u2 = unicode['piglatin'](u1)   # encoding
> > >>u1 = unicode(u2, 'piglatin')   # decoding
> > > 
> > > Your provided semantics feel cumbersome and confusing to me, as
> > > compared with str/unicode.encode/decode() .
> > > 
> > >  - Josiah
> > 
> > This uses syntax to determine the direction of encoding.  It would be 
> > easier and clearer to just require two arguments or a tuple.
> > 
> >   u = unicode(b, 'encode', 'base64')
> >   b = bytes(u, 'decode', 'base64')
> > 
> >   b = bytes(u, 'encode', 'utf-8')
> >   u = unicode(b, 'decode', 'utf-8')
> > 
> >   u2 = unicode(u1, 'encode', 'piglatin')
> >   u1 = unicode(u2, 'decode', 'piglatin')
> > 
> > 
> > 
> > It looks somewhat cleaner if you combine them in a path style string.
> > 
> >   b = bytes(u, 'encode/utf-8')
> >   u = unicode(b, 'decode/utf-8')
> 
> It gets from bad to worse :(
> 
> I always liked the assymmetry between
> 
> u = unicode(s, "utf8")
> 
> and
> 
> s = u.encode("utf8")
> 
> which I think was the original design of the unicode API. Cudos for
> whoever came up with that.

I personally have never used that mechanism.  I always used
s.decode('utf8') and u.encode('utf8').  I prefer the symmetry that
.encode() and .decode() offer.


> When I saw
> 
> b = bytes(u, "utf8")
> 
> mentioned for the first time, I thought: why on earth must the bytes
> constructor be coupled to the unicode API?!?! It makes no sense to me
> whatsoever.

It's not a 'unicode API'.  See integers for another example where a
second argument to a type object defines how to interpret the other
argument, or even arrays/structs where the first argument defines the
interpretation.


> Bytes have so much more use besides encoded text.

Agreed.


> I believe (please correct me if I'm wrong) that the encoding argument of
> bytes() was invented to make it easier to write byte literals. Perhaps a
> true bytes literal notation is in order after all?

Maybe, but I think the other earlier use-case was for using:
s2 = bytes(s1, 'base64')
If bytes objects recieved an .encode() method, or even a .tobytes()
method.  I could be misremembering.


> My preference for bytes -> unicode -> bytes API would be this:
> 
> u = unicode(b, "utf8")  # just like we have now
> b = u.tobytes("utf8")   # like u.encode(), but being explicit
> # about the resulting type
> 
> As to base64, while it works as a codec ("Why a base64 codec? Because we
> can!"), I don't find it a natural API at all, for such conversions.

Depending on whose definiton of codec you listen to (is it a
compressor/decompressor, or a coder/decoder?), either very little of
what we have as 'codecs' are actual codecs (only zlib, etc.), or all of
them are.

I would imagine that base64, etc., were made into codecs, or really
encodings, because base64 is an 'encoding' of binary data in base64
format.  Similar to the way you can think of utf8 is an 'encoding' of
textual data in utf8 format.  I would argue, due to the "one obvious way
to do it", that using encodings/codecs should be preferred to one-shot
encoding/decoding functions in various modules (with some exceptions).

These exceptions are things like pickle, marshal, struct, etc., which
may take a non-basestring object and convert it into a byte string,
which is arguably an encoding of the object in a particular format.


 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Just van Rossum
Ron Adam wrote:

> Josiah Carlson wrote:
> > Greg Ewing <[EMAIL PROTECTED]> wrote:
> >>u = unicode(b)
> >>u = unicode(b, 'utf8')
> >>b = bytes['utf8'](u)
> >>u = unicode['base64'](b)   # encoding
> >>b = bytes(u, 'base64') # decoding
> >>u2 = unicode['piglatin'](u1)   # encoding
> >>u1 = unicode(u2, 'piglatin')   # decoding
> > 
> > Your provided semantics feel cumbersome and confusing to me, as
> > compared with str/unicode.encode/decode() .
> > 
> >  - Josiah
> 
> This uses syntax to determine the direction of encoding.  It would be 
> easier and clearer to just require two arguments or a tuple.
> 
>   u = unicode(b, 'encode', 'base64')
>   b = bytes(u, 'decode', 'base64')
> 
>   b = bytes(u, 'encode', 'utf-8')
>   u = unicode(b, 'decode', 'utf-8')
> 
>   u2 = unicode(u1, 'encode', 'piglatin')
>   u1 = unicode(u2, 'decode', 'piglatin')
> 
> 
> 
> It looks somewhat cleaner if you combine them in a path style string.
> 
>   b = bytes(u, 'encode/utf-8')
>   u = unicode(b, 'decode/utf-8')

It gets from bad to worse :(

I always liked the assymmetry between

u = unicode(s, "utf8")

and

s = u.encode("utf8")

which I think was the original design of the unicode API. Cudos for
whoever came up with that.

When I saw

b = bytes(u, "utf8")

mentioned for the first time, I thought: why on earth must the bytes
constructor be coupled to the unicode API?!?! It makes no sense to me
whatsoever. Bytes have so much more use besides encoded text.

I believe (please correct me if I'm wrong) that the encoding argument of
bytes() was invented to make it easier to write byte literals. Perhaps a
true bytes literal notation is in order after all?

My preference for bytes -> unicode -> bytes API would be this:

u = unicode(b, "utf8")  # just like we have now
b = u.tobytes("utf8")   # like u.encode(), but being explicit
# about the resulting type

As to base64, while it works as a codec ("Why a base64 codec? Because we
can!"), I don't find it a natural API at all, for such conversions.

(I do however agree with Greg Ewing that base64 encoded data is text,
not ascii-encoded bytes ;-)

Just-my-2-cts
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-02 Thread Ron Adam
Josiah Carlson wrote:
> Greg Ewing <[EMAIL PROTECTED]> wrote:
>>u = unicode(b)
>>u = unicode(b, 'utf8')
>>b = bytes['utf8'](u)
>>u = unicode['base64'](b)   # encoding
>>b = bytes(u, 'base64') # decoding
>>u2 = unicode['piglatin'](u1)   # encoding
>>u1 = unicode(u2, 'piglatin')   # decoding
> 
> Your provided semantics feel cumbersome and confusing to me, as compared
> with str/unicode.encode/decode() .
> 
>  - Josiah

This uses syntax to determine the direction of encoding.  It would be 
easier and clearer to just require two arguments or a tuple.

  u = unicode(b, 'encode', 'base64')
  b = bytes(u, 'decode', 'base64')

  b = bytes(u, 'encode', 'utf-8')
  u = unicode(b, 'decode', 'utf-8')

  u2 = unicode(u1, 'encode', 'piglatin')
  u1 = unicode(u2, 'decode', 'piglatin')



It looks somewhat cleaner if you combine them in a path style string.

  b = bytes(u, 'encode/utf-8')
  u = unicode(b, 'decode/utf-8')

Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Josiah Carlson

Greg Ewing <[EMAIL PROTECTED]> wrote:
>u = unicode(b)
>u = unicode(b, 'utf8')
>b = bytes['utf8'](u)
>u = unicode['base64'](b)   # encoding
>b = bytes(u, 'base64') # decoding
>u2 = unicode['piglatin'](u1)   # encoding
>u1 = unicode(u2, 'piglatin')   # decoding

Your provided semantics feel cumbersome and confusing to me, as compared
with str/unicode.encode/decode() .

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Ron Adam wrote:

> 1. We can specify the operation and not be sure of the resulting type.
> 
>*or*
> 
> 2. We can specify the type and not always be sure of the operation.
> 
> maybe there's a way to specify both so it's unambiguous?

Here's another take on the matter. When we're doing
Unicode encoding or decoding, we're performing a
type conversion. The natural way to write a type
conversion in Python is with a constructor. But we
can't just say

   u = unicode(b)

because that doesn't give enough information. We want
to say that b is really of type e.g. "bytes containing
utf8 encoded text":

   u = unicode(b, 'utf8')

Here we're not thinking of the 'utf8' as selecting an
encoder or decoder, but of giving extra information
about the type of b, that isn't carried by b itself.

Now, going in the other direction, we might think to
write

   b = bytes(u, 'utf8')

But that wouldn't be right, because if we interpret this
consistently it would mean we're saying that u contains
utf8-encoded information, which is nonsense. What we
need is a way of saying "construct me something of type
'bytes containing utf8-encoded text'":

   b = bytes['utf8'](u)

Here I've coined the notation t[enc] which
evaluates to a callable object which constructs an
object of type t by encoding its argument according
to enc.

Now let's consider base64. Here, the roles of bytes
and unicode are reversed, because the bytes are just
bytes without any further interpretation, whereas
the unicode is really "unicode containing base64
encoded data". So we write

   u = unicode['base64'](b)   # encoding

   b = bytes(u, 'base64') # decoding

Note that this scheme is reasonably idiot-proof, e.g.

   u = unicode(b, 'base64')

results in a type error, because this specifies
a decoding operation, and the base64 decoder takes
text as input, not bytes.

What happens with transformations where the input and
output types are the same? In this scheme, they're
not really the same any more, because we're providing
extra type information. Suppose we had a code called
'piglatin' which goes from unicode to unicode. The
types involved are really "text" and "piglatin-encoded
text", so we write

   u2 = unicode['piglatin'](u1)   # encoding

   u1 = unicode(u2, 'piglatin')   # decoding

Here you won't get any type error if you get things
backwards, but there's not much that can be done
about that. You just have to keep straight which
of your strings contain piglatin and which don't.

Is this scheme any better than having encode and
decode methods/functions? I'm not sure, but it
shows that a suitably enhanced notion of "data
type" can be used to replace the notions of
encoding and decoding and maybe reduce potential
confusion about which direction is which.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Ron Adam
Greg Ewing wrote:
> Ron Adam wrote:
> 
>> While playing around with the example bytes class I noticed code reads 
>> much better when I use methods called tounicode and tostring.
>>
>> b64ustring = b.tounicode('base64')
>> b = bytes(b64ustring, 'base64')
> 
> I don't like that, because it creates a dependency
> (conceptually, at least) between the bytes type and
> the unicode type. And why unicode in particular?
> Why should it have a tounicode() method, but not
> a toint() or tofloat() or tolist() etc.?

I don't think it creates a dependency between the types, but it does 
create a stronger relationship between them when a method that returns a 
fixed type is used.

No reason not to other than avoiding having methods that really aren't 
needed. But if it makes sense to have them, sure.  If a codec isn't 
needed probably using a regular constructor should be used instead.


>> I'm not suggesting we start using to-type everywhere, just where it 
>> might make things clearer over decode and encode.
> 
> Another thing is that it only works if the codec
> transforms between two different types. If you
> have a bytes-to-bytes transformation, for example,
> then
> 
>   b2 = b1.tobytes('some-weird-encoding')
> 
> is ambiguous.

Are you asking if it's decoding or encoding?

   bytes to unicode ->  decoding
   unicode to bytes ->  encoding

   bytes to bytes -> ?

Good point, I think this defines part the difficulty.

1. We can specify the operation and not be sure of the resulting type.

   *or*

2. We can specify the type and not always be sure of the operation.

maybe there's a way to specify both so it's unambiguous?


Ron






___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Michael Urman
[My apologies Greg; I meant to send this to the whole list. I really
need a list-reply button in GMail. ]

On 3/1/06, Greg Ewing <[EMAIL PROTECTED]> wrote:
> I don't like that, because it creates a dependency
> (conceptually, at least) between the bytes type and
> the unicode type.

I only find half of this bothersome. The unicode type has a pretty
clear dependency on the bytestring type: all I/O needs to be done in
bytes. Various APIs may mask this by accepting unicode values and
transparently doing the right thing, but from the theoretical
standpoint we pretend there is no simple serialization of unicode
values. But the reverse is not true: the bytestring type has no
dependency on unicode.

As a practicality vs purity, however, I think it's a good choice to
let the bytestring type have a tie to unicode, much like the str type
implicitly does now. But you're absolutely right that adding a
.tounicode begs the question why not a .tointeger?

To try to step back and summarize the viewpoints I've seen so far,
there are three main requirements.

  1) We want things that are conceptually text to be stored in memory
as unicode values.
  2) We want there to be some unambiguous conversion via codecs
between bytestrings and unicode values. This should help teaching,
learning, and remembering unicode.
  3) We want a way to apply and reverse compressions, encodings,
encryptions, etc., which are not only between bytestrings and unicode
values; they may be between any two arbitrary types. This allows
writing practical programs.

There seems to be little disagreement over 1, provided sufficiently
efficient implementation, or sufficient string powers in the
bytestring type. To satisfy both 2 and 3, there seem to be a couple
options. What other requirements do we have?

For (2):
  a) Restrict the existing helpers to be only bytestring.decode and
unicode.encode, possibly enforcing output types of the opposite kind,
and removing bytestring.encode
  b) Add new methods with these semantics, e.g. bytestring.udecode and
unicode.uencode

For (3):
  c) Create new helpers codecs.encode(obj, encoding, errors) and
codecs.decode(obj, encoding, errors)
  d) [Keep existing bytestring and unicode helper methods as is, and]
require use of codecs.getencoder() and codecs.getdecoder() for
arbitrary starting object types

Obviously 2a and 3d do not work together, but 2b and 3c work with
either complementary option. What other options do we have?

Michael
--
Michael Urman  http://www.tortall.net/mu/blog
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Ron Adam wrote:

> While playing around with the example bytes class I noticed code reads 
> much better when I use methods called tounicode and tostring.
> 
> b64ustring = b.tounicode('base64')
> b = bytes(b64ustring, 'base64')

I don't like that, because it creates a dependency
(conceptually, at least) between the bytes type and
the unicode type. And why unicode in particular?
Why should it have a tounicode() method, but not
a toint() or tofloat() or tolist() etc.?

> I'm not suggesting we start using to-type everywhere, just where it 
> might make things clearer over decode and encode.

Another thing is that it only works if the codec
transforms between two different types. If you
have a bytes-to-bytes transformation, for example,
then

   b2 = b1.tobytes('some-weird-encoding')

is ambiguous.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Bill Janssen wrote:

> No, once it's in a particular encoding it's bytes, no longer text.

The point at issue is whether the characters produced
by base64 are in a particular encoding. According to
my reading of the RFC, they're not.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Greg Ewing
Nick Coghlan wrote:

>ascii_bytes = orig_bytes.decode("base64").encode("ascii")
> 
>orig_bytes = ascii_bytes.decode("ascii").encode("base64")
> 
> The only slightly odd aspect is that this inverts the conventional meaning of 
> base64 encoding and decoding,

-1. Whatever we do, we shouldn't design things so
that it's necessary to write anything as
unintuitive as that.

We need to make up our minds whether the .encode()
and .decode() methods are only meant for Unicode
encodings, or whether they are for general
transformations between bytes and characters.

If they're only meant for Unicode, then bytes
should only have .decode(), unicode strings
should only have .encode(), and only Unicode
codecs should be available that way. Things
like base64 would need to have a different
interface.

If they're for general transformations, then
both types should have both methods, with the
return type depending on the codec you're
using, and it's the programmer's responsibility
to use codecs that make sense for what he's
doing.

But if they're for general transformations,
why limit them to just bytes and characters?
Following that through leads to giving *every*
object .encode() and .decode() methods. I
don't think we should go that far, but it's
hard to see where to draw the line. Are
bytes and strings special enough to justify
them having their own peculiar methods for
codec access?

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Michael Chermside
I wrote:
> ... I will say that if there were no legacy I'd prefer the tounicode()
> and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0.

Scott Daniels replied:
> Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode?

Um... yes. Sorry, I'm not completely used to 3.0 yet. I'll need to borrow
the time machine for a little longer before my fingers really pick up on
the 3.0 names and idioms.

-- Michael Chermside

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Scott David Daniels
Chermside, Michael wrote:
> ... I will say that if there were no legacy I'd prefer the tounicode()
> and tostring() (but shouldn't itbe 'tobytes()' instead?) names for Python 3.0.

Wouldn't 'tobytes' and 'totext' be better for 3.0 where text == unicode?

-- 
-- Scott David Daniels
[EMAIL PROTECTED]

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Bill Janssen
> Huh... just joining here but surely you don't mean a text string that
> doesn't use every character available in a particular encoding is
> "really bytes"... it's still a text string...

No, once it's in a particular encoding it's bytes, no longer text.

As you say,
> Keep these two concepts separate and you should be right :-)

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Chermside, Michael
Ron Adam writes:
> While playing around with the example bytes class I noticed code reads

> much better when I use methods called tounicode and tostring.
[...]
> I'm not suggesting we start using to-type everywhere, just where it 
> might make things clearer over decode and encode.

+1

I always find myself slightly confused by encode() and decode()
despite the fact that I understand (I think) the reason for the
choice of those names and by rights ought to have no trouble.

I'm not arguing that it's worth the gratuitous code breakage (I
don't have enough code using encode() and decode() for my opinion
to count in that matter.) But I will say that if there were no
legacy I'd prefer the tounicode() and tostring() (but shouldn't it
be 'tobytes()' instead?) names for Python 3.0.

-- Michael Chermside





*
This email may contain confidential or privileged information. If you believe
 you have received the message in error, please notify the sender and delete 
the message without copying or disclosing it.
*

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Ron Adam
Nick Coghlan wrote:
> All the unicode codecs, on the other hand, use encode to get from characters 
> to bytes and decode to get from bytes to characters.
> 
> So if bytes objects *did* have an encode method, it should still result in a 
> unicode object, just the same as a decode method does (because you are 
> encoding bytes as characters), and unicode objects would acquire a 
> corresponding decode method (that decodes from a character format such as 
> base64 to the original byte sequence).
>
> In the name of TOOWTDI, I'd suggest that we just eat the slight terminology 
> glitch in the rare cases like base64, hex and oct (where the character format 
> is technically the encoded format), and leave it so that there is a single 
> method pair (bytes.decode to go from bytes to characters, and text.encode to 
> go from characters to bytes).

I think you have it pretty straight here.


While playing around with the example bytes class I noticed code reads 
much better when I use methods called tounicode and tostring.

b64ustring = b.tounicode('base64')
b = bytes(b64ustring, 'base64')

The bytes could then *not* ignore the string decode codec but use it for 
string to string decoding.

b64string = b.tostring('base64')
b = bytes(b64string, 'base64')

b = bytes(hexstring, 'hex')
hexstring = b.tostring('hex')

hexstring = b.tounicode('hex')

An exception could be raised if the codec does not support input or 
output type depending on the situation.

This would allow for differnt types of codecs to live together without 
as much confusion I think.

I'm not suggesting we start using to-type everywhere, just where it 
might make things clearer over decode and encode.


Expecting it not to fly, but just maybe it could?
   Ron


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Nick Coghlan
Bill Janssen wrote:
> Greg Ewing wrote:
>> Bill Janssen wrote:
>>
>>> bytes -> base64 -> text
>>> text -> de-base64 -> bytes
>> It's nice to hear I'm not out of step with
>> the entire world on this. :-)
> 
> Well, I can certainly understand the bytes->base64->bytes side of
> thing too.  The "text" produced is specified as using "a 65-character
> subset of US-ASCII", so that's really bytes.

If the base64 codec was a text<->bytes codec, and bytes did not have an encode 
method, then if you want to convert your original bytes to ascii bytes, you 
would do:

   ascii_bytes = orig_bytes.decode("base64").encode("ascii")

"Use base64 to convert my byte sequence to characters, then give me the 
corresponding ascii byte sequence"

To reverse the process:

   orig_bytes = ascii_bytes.decode("ascii").encode("base64")

"Use ascii to convert my byte sequence to characters, then use base64 to 
convert those characters back to the original byte sequence"

The only slightly odd aspect is that this inverts the conventional meaning of 
base64 encoding and decoding, where you expect to encode from bytes to 
characters and decode from characters to bytes.

As strings currently have both methods, the existing codec is able to use the 
conventional sense for base64: encode goes from "str-as-bytes" to 
"str-as-text" (giving a longer string with characters that fit in the base64 
subset) and decode goes from "str-as-text" to "str-as-bytes" (giving back the 
original string)

All the unicode codecs, on the other hand, use encode to get from characters 
to bytes and decode to get from bytes to characters.

So if bytes objects *did* have an encode method, it should still result in a 
unicode object, just the same as a decode method does (because you are 
encoding bytes as characters), and unicode objects would acquire a 
corresponding decode method (that decodes from a character format such as 
base64 to the original byte sequence).

In the name of TOOWTDI, I'd suggest that we just eat the slight terminology 
glitch in the rare cases like base64, hex and oct (where the character format 
is technically the encoded format), and leave it so that there is a single 
method pair (bytes.decode to go from bytes to characters, and text.encode to 
go from characters to bytes).

Cheers,
Nick.

-- 
Nick Coghlan   |   [EMAIL PROTECTED]   |   Brisbane, Australia
---
 http://www.boredomandlaziness.org
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-03-01 Thread Donovan Baarda
On Tue, 2006-02-28 at 15:23 -0800, Bill Janssen wrote:
> Greg Ewing wrote:
> > Bill Janssen wrote:
> > 
> > > bytes -> base64 -> text
> > > text -> de-base64 -> bytes
> > 
> > It's nice to hear I'm not out of step with
> > the entire world on this. :-)
> 
> Well, I can certainly understand the bytes->base64->bytes side of
> thing too.  The "text" produced is specified as using "a 65-character
> subset of US-ASCII", so that's really bytes.

Huh... just joining here but surely you don't mean a text string that
doesn't use every character available in a particular encoding is
"really bytes"... it's still a text string...

If you base64 encode some bytes, you get a string. If you then want to
access that base64 string as if it was a bunch of bytes, cast it to
bytes.

Be careful not to confuse "(type)cast" with "(type)convert"... 

A "convert" transforms the data from one type/class to another,
modifying it to be a valid equivalent instance of the other type/class;
ie int -> float. 

A "cast" does not modify the data in any way, it just changes its
type/class to be the other type, and assumes that the data is a valid
instance of the other type; eg int32 -> bytes[4]. Minor data munging
under the hood to cleanly switch the type/class is acceptable (ie adding
array length info etc) provided you keep to the spirit of the "cast".

Keep these two concepts separate and you should be right :-)

-- 
Donovan Baarda <[EMAIL PROTECTED]>
http://minkirri.apana.org.au/~abo/

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-28 Thread Greg Ewing
Bill Janssen wrote:

> Well, I can certainly understand the bytes->base64->bytes side of
> thing too.  The "text" produced is specified as using "a 65-character
> subset of US-ASCII", so that's really bytes.

But it then goes on to say that these same characters
are also a subset of EBCDIC. So it seems to be
talking about characters as abstract entities here,
not as bit patterns.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-28 Thread Bill Janssen
Greg Ewing wrote:
> Bill Janssen wrote:
> 
> > bytes -> base64 -> text
> > text -> de-base64 -> bytes
> 
> It's nice to hear I'm not out of step with
> the entire world on this. :-)

Well, I can certainly understand the bytes->base64->bytes side of
thing too.  The "text" produced is specified as using "a 65-character
subset of US-ASCII", so that's really bytes.

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-28 Thread Guido van Rossum
On 2/28/06, Greg Ewing <[EMAIL PROTECTED]> wrote:
> Bill Janssen wrote:
>
> > bytes -> base64 -> text
> > text -> de-base64 -> bytes
>
> It's nice to hear I'm not out of step with
> the entire world on this. :-)

What Bill proposes makes sense to me.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-28 Thread Greg Ewing
Bill Janssen wrote:

> bytes -> base64 -> text
> text -> de-base64 -> bytes

It's nice to hear I'm not out of step with
the entire world on this. :-)

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-27 Thread Greg Ewing
Bill Janssen wrote:

> I use it quite a bit for image processing (converting to and from the
> "data:" URL form), and various checksum applications (converting SHA
> into a string).

Aha! We have a customer!

For those cases, would you find it more convenient
for the result to be text or bytes in Py3k?

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-27 Thread Bill Janssen
> If implementing a mime packer is really the only use case
> for base64, then it might as well be removed from the
> standard library, since 99.9% of all programmers will
> never touch it. Those that do will need to have boned up

I use it quite a bit for image processing (converting to and from the
"data:" URL form), and various checksum applications (converting SHA
into a string).

Bill
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-27 Thread Greg Ewing
Stephen J. Turnbull wrote:

> Greg> I'd be perfectly happy with ascii characters, but in Py3k,
> Greg> the most natural place to keep ascii characters will be in
> Greg> character strings, not byte arrays.
> 
> Natural != practical.

That seems to be another thing we disagree about --
to me it seems both natural *and* practical.

The whole business of stuffing binary data down a
text channel is a practicality-beats-purity kind of
thing. You wouldn't do it if you had a real binary
channel available, but if you don't, it's better
than nothing.

> The base64 string is a representation of an object
 > that doesn't have text semantics.

But the base64 string itself *does* have text
semantics. That's the whole point of base64 --
to represent a non-text object *using* text.

To me this is no different than using a string
of decimal digit characters to represent an
integer, or a string of hexadecimal digit
characters to represent a bit pattern. Would
you say that those are not text, either?

What about XML? What would you consider the
proper data type for an XML document to be
inside a Python program -- bytes or text?
I'm genuinely interested in your answer to
that, because I'm trying to understand where
you draw the line between text and non-text.

You seem to want to reserve the term "text" for
data that doesn't ever have to be understood
even a little bit by a computer program, but that
seems far too restrictive to me, and a long
way from established usage.

> Nor do base64 strings have text semantics: they can't even
> be concatenated as text ...  So if you
> wish to concatenate the underlying objects, the base64 strings must be
> decoded, concatenated, and re-encoded in the general case.

You can't add two integers by concatenating
their base-10 character representation, either,
but I wouldn't take that as an argument against
putting decimal numbers into text files.

Also, even if we follow your suggestion and store
our base64-encoded data in byte arrays, we *still*
wouldn't be able to concatenate the original data
just by concatenating those byte arrays. So this
argument makes no sense either way.

> IMO it's not worth preserving the very superficial
 > coincidence of "character representation"

I disagree entirely that it's superficial. On the
contrary, it seems to me to be very essence of
what base64 is all about.

If there's any "coincidence of representation" it's
in the idea of storing the result as ASCII bit patterns
in a byte array, on the assumption that that's probably
how they're going to end up being represented down the
line.

That assumption could be very wrong. What happens if
it turns out they really need to be encoded as UTF-16,
or as EBCDIC? All hell breaks loose, as far as I can
see, unless the programmer has kept very firmly in
mind that there is an implicit ASCII encoding involved.

It's exactly to avoid the need for those kinds of
mental gymnastics that Py3k will have a unified,
encoding-agnostic data type for all character strings.

> I think that fact that favoring the coincidence of representation
> leads you to also deprecate the very natural use of the codec API to
> implement and understand base64 is indicative of a deep problem with
> the idea of implementing base64 as bytes->unicode.

Not sure I'm following you. I don't object to
implementing base64 as a codec, only to exposing
it via the same interface as the "real" unicode
codecs like utf8, etc. I thought we were in
agreement about that.

If you're thinking that the mere fact its input
type is bytes and its output type is characters is
going to lead to its mistakenly appearing via that
interface, that would be a bug or design flaw in
the mechanism that controls which codecs appear
via that interface. It needs to be controlled by
something more than just the input and output
types.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-26 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> Stephen J. Turnbull wrote:

>> I gave you one, MIME processing in email

Greg> If implementing a mime packer is really the only use case
Greg> for base64, then it might as well be removed from the
Greg> standard library, since 99.9% of all programmers will
Greg> never touch it.  I don't have any real-life use cases for
Greg> base64 that a non-mime-implementer might come across, so all
Greg> I can do is imagine what shape such a use case might have.

I guess we don't have much to talk about, then.

>> Give me a use case where it matters practically that the output
>> of the base64 codec be Python unicode characters rather than
>> 8-bit ASCII characters.

Greg> I'd be perfectly happy with ascii characters, but in Py3k,
Greg> the most natural place to keep ascii characters will be in
Greg> character strings, not byte arrays.

Natural != practical.

Anyway, I disagree, and I've lived with the problems that come with an
environment that mixes objects with various underlying semantics into
a single "text stream" for a decade and a half.

That doesn't make me authoritative, but as we agree to disagree, I
hope you'll keep in mind that someone with real-world experience that
is somewhat relevant[1] to the issue doesn't find that natural at all.

Greg> Since the Unicode character set is a superset of the ASCII
Greg> character set, it doesn't seem unreasonable that they could
Greg> also be thought of as Unicode characters.

I agree.  However, as soon as I go past that intuition to thinking
about what that implies for _operations_ on the base64 string, it
begins to seem unreasonable, unnatural, and downright dangerous.  The
base64 string is a representation of an object that doesn't have text
semantics.  Nor do base64 strings have text semantics: they can't even
be concatenated as text (the pad character '=' is typically a syntax
error in a profile of base64, except as terminal padding).  So if you
wish to concatenate the underlying objects, the base64 strings must be
decoded, concatenated, and re-encoded in the general case.  IMO it's
not worth preserving the very superficial coincidence of "character
representation" in the face of such semantics.

I think that fact that favoring the coincidence of representation
leads you to also deprecate the very natural use of the codec API to
implement and understand base64 is indicative of a deep problem with
the idea of implementing base64 as bytes->unicode.


Footnotes: 
[1]  That "somewhat" is intended literally; my specialty is working
with codecs for humans in Emacs, but I've also worked with more
abstract codecs such as base64 in contexts like email, in both LISP
and Python.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-26 Thread Greg Ewing
Stephen J. Turnbull wrote:

 > I gave you one, MIME processing in email

If implementing a mime packer is really the only use case
for base64, then it might as well be removed from the
standard library, since 99.9% of all programmers will
never touch it. Those that do will need to have boned up
on the subject of encoding until it's coming out their
ears, so they'll know what they're doing in any case. And
they'll be quite competent to write their own base64
encoder that works however they want it to.

I don't have any real-life use cases for base64 that a
non-mime-implementer might come across, so all I can do
is imagine what shape such a use case might have.

When I do that, I come up with what I've already described.
The programmer wants to send arbitrary data over a channel
that only accepts text. He doesn't know, and doesn't want
to have to know, how the channel encodes that text --
it might be ASCII or EBCDIC or morse code, it shouldn't
matter. If his Python base64 encoder produces a Python
character string, and his Python channel interface accepts
a Python character string, he doesn't have to know.

> I think it's your turn.  Give me a use case where it matters
> practically that the output of the base64 codec be Python unicode
> characters rather than 8-bit ASCII characters.

I'd be perfectly happy with ascii characters, but in Py3k,
the most natural place to keep ascii characters will be in
character strings, not byte arrays.

 > Everything you have written so far is based on
> defending your maintained assumption that because Python implements
> text processing via the unicode type, everything that is described as
> a "character" must be coerced to that type.

I'm not just blindly assuming that because the RFC happens
to use the word "character". I'm also looking at how it uses
that word in an effort to understand what it means. It
*doesn't* specify what bit patterns are to be used to
represent the characters. It *does* mention two "character
sets", namely ASCII and EBCDIC, with the implication that
the characters it is talking about could be taken as being
members of either of those sets. Since the Unicode character
set is a superset of the ASCII character set, it doesn't
seem unreasonable that they could also be thought of as
Unicode characters.

> I don't really see a downside, except for the occasional double
> conversion ASCII -> unicode -> UTF-16, as is allowed (but not
> mandated) in XML's use of base64.  What downside do you see?

It appears that all your upsides I see as downsides, and
vice versa. We appear to be mutually upside-down. :-)

XML is another example. Inside a Python program, the most
natural way to represent an XML is as a character string.
Your way, embedding base64 in it would require converting
the bytes produced by the base64 encoder into a character
string in some way, taking into account the assumed ascii
encoding of said bytes. My way, you just use the result
directly, with no coding involved at all.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-26 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> I think we need some concrete use cases to talk about if
Greg> we're to get any further with this. Do you have any such use
Greg> cases in mind?

I gave you one, MIME processing in email, and a concrete bug that is
possible with the design you propose, but not in mine.  You said, "the
programmers need to try harder."  If that's an acceptable answer, I
have to concede it beats all any use case I can imagine.

I think it's your turn.  Give me a use case where it matters
practically that the output of the base64 codec be Python unicode
characters rather than 8-bit ASCII characters.

I don't think you can.  Everything you have written so far is based on
defending your maintained assumption that because Python implements
text processing via the unicode type, everything that is described as
a "character" must be coerced to that type.

If you give up that assumption, you get

1.  an automatic reduction in base64.upcase() bugs because it's a type
error, ie, binary objects are not text objects, no matter what their
representation

2.  encouragement to programmer teams to carry along binary objects as
opaque blobs until they're just about to put them on the wire,
then let the wire protocol guy implement the conversion at that point

3.  efficiency for a very common case where ASCII octets are the wire
representation

4.  efficient and clear implementation and documentation using the
codec framework and API

I don't really see a downside, except for the occasional double
conversion ASCII -> unicode -> UTF-16, as is allowed (but not
mandated) in XML's use of base64.  What downside do you see?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Ron Adam
Greg Ewing wrote:
> Stephen J. Turnbull wrote:
>> It's "what is the Python compiler/interpreter going
>  > to think?"  AFAICS, it's going to think that base64 is
>  > a unicode codec.
> 
> Only if it's designed that way, and I specifically
> think it shouldn't -- i.e. it should be an error
> to attempt the likes of a_unicode_string.encode("base64")
> or unicode(something, "base64"). The interface for
> doing base64 encoding should be something else.

I agree

> I think we need some concrete use cases to talk
> about if we're to get any further with this. Do
> you have any such use cases in mind?
> 
> Greg

Or at least some where more concrete than trying to debate abstract 
meanings. ;-)

Running a short test over all the codecs and converting u"Python" to 
string and back to unicode resulted in the following output.  These are 
the only ones of the 92 that couldn't do the round trip successfully.

It seems to me these will need to be moved and/or made to work with 
unicode at some point.


Test 1: UNICODE -> STRING -> UNICODE
1: 'bz2_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

2: 'hex_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

3: 'uu_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

4: 'quopri_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

5: 'base64_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

6: 'zlib_codec'
Traceback (most recent call last):
   File "codectest.py", line 29, in test1
 u2 = unicode(s, c)   # to unicode
TypeError: decoder did not return an unicode object (type=str)

7: 'tactis'
Traceback (most recent call last):
   File "codectest.py", line 28, in test1
 s = u1.encode(c) # to string
LookupError: unknown encoding: tactis





___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Greg Ewing
Stephen J. Turnbull wrote:

> The reason that Python source code is text is that the primary
> producers/consumers of Python source code are human beings, not
> compilers

I disagree with "primary" -- I think human and computer
use of source code have equal importance. Because of the
fact that Python source code must be acceptable to the
Python compiler, a great many transformations that would
be harmless to English text (upper casing, paragraph
wrapping, etc.) would cause disaster if applied to a
Python program. I don't see how base64 is any different.

> Yes, which implies that you assume he has control of the data all the
> way to the channel that actually requires base64.

Yes. If he doesn't, he can't safely use base64 at all.
That's true regardless of how the base64-encoded data
is represented. It's true of any data of any kind.

> Use case: the Gnus MUA supports the RFC that allows non-ASCII names in
> MIME headers that take file names...

I'm not familiar with all the details you're alluding
to here, but if there's a bug here, I'd say it's due
to somebody not thinking something through properly.
It shouldn't matter if something gets encoded four
times as long as it gets decoded four times at the
other end. If it's not possible to do that, someone
made an assumption about the channel that wasn't
true.

> It's "what is the Python compiler/interpreter going
 > to think?"  AFAICS, it's going to think that base64 is
 > a unicode codec.

Only if it's designed that way, and I specifically
think it shouldn't -- i.e. it should be an error
to attempt the likes of a_unicode_string.encode("base64")
or unicode(something, "base64"). The interface for
doing base64 encoding should be something else.

> I don't believe that "takes a character string as
> input" has any intrinsic meaning.

I'm using that phrase in the context of Python, where
it means "a function that takes a Python character
string as input".

In the particular case of base64, it has the added
restriction that it must preserve the particular
65 characters used.

 > In practice, I think it's a loaded gun
> aimed at my foot.  And yours.

Whereas it seems quite the opposite to me, i.e.
*failing* to clearly distinguish between text and
binary data here is what will lead to confusion and
foot-shooting.

I think we need some concrete use cases to talk
about if we're to get any further with this. Do
you have any such use cases in mind?

Greg


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Stephen J. Turnbull
> "Ron" == Ron Adam <[EMAIL PROTECTED]> writes:

Ron> So, lets consider a "codec" and a "coding" as being two
Ron> different things where a codec is a character sub set of
Ron> unicode characters expressed in a native format.  And a
Ron> coding is *not* a subset of the unicode character set, but an
Ron> _opperation_ performed on text.

Ron> codec ->  text is always in *one_codec* at any time.

No, a codec is an operation, not a state.

And text qua text has no need of state; the whole point of defining
text (as in the unicode type) is to abstract from such
representational issues.

Ron> Pure codecs such as latin-1 can be envoked over and over and
Ron> you can always get back what you put in in a single step.

Maybe you'd like to define them that way, but it doesn't work in
general.  Given that str and unicode currently don't carry state with
them, it's not possible for "to ASCII" and "to EBCDIC" to be
idempotent at the same time.  And for the languages spoken by 75% of
the world's population, "to latin-1" cannot be successfully invoked
even once, let alone be idempotent.  You really need to think about
how your examples apply to codecs like KOI8-R for Russian and Shift
JIS for Japanese.

In practice, I just don't think you can distinguish "codecs" from
"coding" using the kind of mathematical properties you have described
here.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-25 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> Stephen J. Turnbull wrote:

>> the kind of "text" for which Unicode was designed is normally
>> produced and consumed by people, who wll pt up w/ ll knds f
>> nnsns.  Base64 decoders will not put up with the same kinds of
>> nonsense that people will.

Greg> The Python compiler won't put up with that sort of nonsense
Greg> either. Would you consider that makes Python source code
Greg> binary data rather than text, and that it's inappropriate to
Greg> represent it using a unicode string?

The reason that Python source code is text is that the primary
producers/consumers of Python source code are human beings, not
compilers.

There are no such human producers/consumers of base64.  Unless you
prefer that I expressed that last sentence as "VGhlIHJlYXNvbiB0aG
F0IFB5dGhvbiBzb3VyY2UgY29kZSBpcyB0ZXh0IGlzIGJlY2F1c2UgdGhlIHByaW1
hcnkKcHJvZHVjZXJzL2NvbnN1bWVycyBvZiBQeXRob24gc291cmNlIGNvZGUgYXJl
IGh1bWFuIGJlaW5ncywgbm90CmNvbXBpbGVycy4="?

>> You're basically assuming that the person who implements the
>> code that processes a Unicode string is the same person who
>> implemented the code that converts a binary object into base64
>> and inserts it into a string.

Greg> No, I'm assuming the user of base64 knows the
Greg> characteristics of the channel he's using.

Yes, which implies that you assume he has control of the data all the
way to the channel that actually requires base64.

Use case: the Gnus MUA supports the RFC that allows non-ASCII names in
MIME headers that take file names.  The interface was written for
message-at-a-time use, which makes sense for composition.  Somebody
else added "save and strip part" editing capability, but this only
works one MIME part at a time.  So if you have a message with four
MIME parts and you save and strip all of them, the first one gets
encoded four times.

The reason for *this* bug, and scores like it over the years, is that
somebody made it convenient to put wire protocols into a text
document.  Shouldn't Python do better than that?  Shouldn't Python
text be for humans, rather than be whatever had the tag "character"
attached to it for convenience of definition of a protocol for
communication of data humans can't process without mechanical
assistance?

>> I don't think it's a good idea to gratuitously introduce wire
>> protocols as unicode codecs,

Greg> I am *not* saying that base64 is a unicode codec!  If that's
Greg> what you thought I was saying, it's no wonder we're
Greg> confusing each other.

I know you don't think that it's a duck, but it waddles and quacks.
Ie, the question is not what I think you're saying.  It's "what is the
Python compiler/interpreter going to think?"  AFAICS, it's going to
think that base64 is a unicode codec.

Greg> The only time I need to use something like base64 is when I
Greg> have something that will only accept text. In Py3k, "accepts
Greg> text" is going to mean "takes a character string as input",

Characters are inherently abstract, as a class they can't be
instantiated as input or output---only derived (ie, encoded)
characters can.  I don't believe that "takes a character string as
input" has any intrinsic meaning.

Greg> Does that make it clearer what I'm getting at?

No.  I already understood what you're getting at.  As I said, I'm
sympathetic in principle.  In practice, I think it's a loaded gun
aimed at my foot.  And yours.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Greg Ewing
Stephen J. Turnbull wrote:

> the kind of "text" for which Unicode was designed is normally produced
> and consumed by people, who wll pt up w/ ll knds f nnsns.  Base64
> decoders will not put up with the same kinds of nonsense that people
> will.

The Python compiler won't put up with that sort of
nonsense either. Would you consider that makes Python
source code binary data rather than text, and that
it's inappropriate to represent it using a unicode
string?

> You're basically assuming that the person who implements the code that
> processes a Unicode string is the same person who implemented the code
> that converts a binary object into base64 and inserts it into a
> string.

No, I'm assuming the user of base64 knows the
characteristics of the channel he's using. You
can only use base64 if you know the channel
promises not to munge the particular characters
that base64 uses. If you don't know that, you
shouldn't be trying to send base64 through that
channel.

> In most environments, it should be possible to hide bytes<->unicode
> codecs almost all the time,

But it *is* hidden in the situation I'm talking
about, because all the Unicode encoding/decoding
takes place inside the implementation of the
text channel, which I'm taking as a given.

> I don't think it's a good idea to gratuitously introduce
 > wire protocols as unicode codecs,

I am *not* saying that base64 is a unicode codec!
If that's what you thought I was saying, it's no
wonder we're confusing each other.

It's just a transformation from bytes to
text. I'm only calling it unicode because all
text will be unicode in Py3k. In py2.x it could
just as well be a str -- but a str interpreted
as text, not binary.

> What do you think the email module does?
> Assuming conforming MIME messages

But I'm not assuming mime in the first place. If I
have a mail interface that will accept chunks of
binary data and encode them as a mime message for
me, then I don't need to use base64 in the first
place.

The only time I need to use something like base64
is when I have something that will only accept
text. In Py3k, "accepts text" is going to mean
"takes a character string as input", where
"character string" is a distinct type from
"binary data". So having base64 produce anything
other than a character string would be awkward
and inconvenient.

I phrased that paragraph carefully to avoid using
the word "unicode" anywhere. Does that make it
clearer what I'm getting at?

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Ron Adam

* The following reply is a rather longer than I intended explanation of 
why codings (and how they differ) like 'rot' aren't the same thing as 
pure unicode codecs and probably should be treated differently.
If you already understand that, then I suggest skipping this.  But if 
you like detailed logical analysis, it might be of some interest even if 
it's reviewing the obvious to those who already know.

(And hopefully I didn't make any really obvious errors myself.)


Stephen J. Turnbull wrote:
>> "Ron" == Ron Adam <[EMAIL PROTECTED]> writes:
> 
> Ron> We could call it transform or translate if needed.
> 
> You're still losing the directionality, which is my primary objection
> to "recode".  The absence of directionality is precisely why "recode"
> is used in that sense for i18n work.

I think your not understanding what I suggested.  It might help if we 
could agree on some points and then go from there.

So, lets consider a "codec" and a "coding" as being two different things 
where a codec is a character sub set of unicode characters expressed in 
a native format.  And a coding is *not* a subset of the unicode 
character set, but an _opperation_ performed on text.  So you would have 
the following properties.

codec ->  text is always in *one_codec* at any time.

coding ->  operation performed on text.

Lets add a special default coding called 'none' to represent a do 
nothing coding. (figuratively for explanation purposes)

'none' -> return the input as is, or the uncoded text


Given the above relationships we have the following possible 
transformations.

   1. codec to like codec:   'ascii' to 'ascii'
   2. codec to unlike codec:   'ascii' to 'latin1'

And we have coding relationships of:

   a. coding to like coding  # Unchanged, do nothing
   b. coding to unlike coding


Then we can express all the possible combinations as...

[1.a, 1.b, 2.a, 2.b]


1.a -> coding in codec to like coding in like codec:

'none' in 'ascii' to 'none' in 'ascii'

1.b -> coding in codec to diff coding in like codec:

'none' in 'ascii' to 'base64' in 'ascii'

2.a -> coding in codec to same coding in diff codec:

'none' in 'ascii' to 'none' in 'latin1'

2.b -> coding in codec to diff coding in diff codec:

'none' in 'latin1' to 'base64' in 'ascii'

This last one is a problem as some codecs combine coding with character 
set encoding and return text in a differnt encoding than they recieved. 
  The line is also blurred between types and encodings.  Is unicode and 
encoding?  Will bytes also be a encoding?


Using the above combinations:

(1.a) is just creating a new copy of a object.

s = str(s)


(1.b) is recoding an object, it returns a copy of the object in the same 
encoding.

s = s.encode('hex-codec')  # ascii str -> ascii str coded in hex
s = s.decode('hex-codec')  # ascii str coded in hex -> ascii str

* these are really two differnt operations. And encoding repeatedly 
results in nested codings.  Codecs (as a pure subset of unicode) don't 
have that property.

* the hex-codec also fit the 2.b pattern below if the source string is 
of a differnt type than ascii. (or the the default string?)


(2.a) creates a copy encoded in a new codec.

s = s.encode('latin1')

* I beleive string constructors should have a encoding argument for use 
with unicode strings.

s = str(u, 'latin1')   # This would match the bytes constructor.


(2.b) are combinations of the above.

   s = u.encode('base64')
  # unicode to ascii string as base64 coded characters

   u = unicode(s.decode('base64'))
  # ascii string coded in base64 to unicode characters

or

>>> u = unicode(s, 'base64')
  Traceback (most recent call last):
File "", line 1, in ?
  TypeError: decoder did not return an unicode object (type=str)

Ooops...  ;)

So is coding the same as a codec?  I think they have different 
properties and should be treated differently except when the 
practicality over purity rule is needed.  And in those cases maybe the 
names could clearly state the result.

u.decode('base64ascii')  # name indicates coding to codec


> A string. -> QSBzdHJpbmcu -> UVNCemRISnBibWN1

Looks like the underlying sequence is:

  native string -> unicode -> unicode coded base64 -> coded ascii str

And decode operation would be...

  coded ascii str -> unicode coded base64 -> unicode -> ascii str

Except it may combine some of these steps to speed it up.

Since it's a hybred codec including a coding operation. We have to treat 
it as a codec.


> Ron> * Given that the string type gains a __codec__ attribute
> Ron> to handle automatic decoding when needed.  (is there a reason
> Ron> not to?)
> 
> Ron>str(object[,codec][,error]) -> string coded with codec
> 
> Ron>unicode(object[,error]) -> unicode
> 
> Ron>bytes(object) -> bytes
> 
> str == unicode in Py3k, so this is a non-starter.  What do you want

Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> Stephen J. Turnbull wrote:

>> No, base64 isn't a wire protocol.  It's a family[...].

Greg> Yes, and it's up to the programmer to choose those code
Greg> units (i.e. pick an encoding for the characters) that will,
Greg> in fact, pass through the channel he is using without
Greg> corruption. I don't see how any of this is inconsistent with
Greg> what I've said.

It's not.  It just shows that there are other "correct" ways to think
about the issue.

>> Only if you do no transformations that will harm the
>> base64-encoding.  ...  It doesn't allow any of the usual
>> transformations on characters that might be applied globally to
>> a mail composition buffer, for example.

Greg> I don't understand that. Obviously if you rot13 your mail
Greg> message or turn it into pig latin or something, it's going
Greg> to mess up any base64 it might contain.  But that would be a
Greg> silly thing to do to a message containing base64.

What "message containing base64"?  "Any base64 in there?"  "Nope,
nobody here but us Unicode characters!"  I certainly hope that in Py3k
bytes objects will have neither ROT13 nor case-changing methods, but
str objects certainly will.  Why give up the safety of that
distinction?

Greg> Given any piece of text, there are things it makes sense to
Greg> do with it and things it doesn't, depending entirely on the
Greg> use to which the text will eventually be put.  I don't see
Greg> how base64 is any different in this regard.

If you're going to be binary about it, it's not different.  However
the kind of "text" for which Unicode was designed is normally produced
and consumed by people, who wll pt up w/ ll knds f nnsns.  Base64
decoders will not put up with the same kinds of nonsense that people
will.

You're basically assuming that the person who implements the code that
processes a Unicode string is the same person who implemented the code
that converts a binary object into base64 and inserts it into a
string.  I think that's a dangerous (and certainly invalid) assumption.

I know I've lost time and data to applications that make assumptions
like that.  In fact, that's why "MULE" is a four-letter word in Emacs
channels.

>> So then you bring it right back in with base64.  Now they need
>> to know about bytes<->unicode codecs.

Greg> No, they need to know about the characteristics of the
Greg> channel over which they're sending the data.

I meant it in a trivial sense: "How do you use a bytes<->unicode codec
properly without knowing that it's a bytes<->unicode codec?"

In most environments, it should be possible to hide bytes<->unicode
codecs almost all the time, and I think that's a very good thing.  I
don't think it's a good idea to gratuitously introduce wire protocols
as unicode codecs, even if a class of bit patterns which represent the
integer 65 are denoted "A" in various sources.  Practicality beats
purity (especially when you're talking about the purity of a pregnant
virgin).

Greg> It might be appropriate to to use base64 followed by some
Greg> encoding, but the programmer needs to be aware of that and
Greg> choose the encoding wisely. It's not possible to shield him
Greg> from having to know about encodings in that situation, even
Greg> if the encoding is just ascii.

What do you think the email module does?  Assuming conforming MIME
messages and receivers capable of handling UTF-8, the user of the
email module does not need to know anything about any encodings at
all.  With a little more smarts, the email module could even make a
good choice of output encoding based on the _language_ of the text,
removing the restriction to UTF-8 on the output side, too.  With the
aid of file(1), it can make excellent guesses about attachments.

Sure, the email module programmer needs to know, but the email module
programmer needs to know an awful lot about codecs anyway, since mail
at that level is a binary channel, while users will be throwing a
mixed bag of binary and textual objects at it.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-24 Thread Stephen J. Turnbull
> "Ron" == Ron Adam <[EMAIL PROTECTED]> writes:

Ron> We could call it transform or translate if needed.

You're still losing the directionality, which is my primary objection
to "recode".  The absence of directionality is precisely why "recode"
is used in that sense for i18n work.

There really isn't a good reason that I can see to use anything other
than the pair "encode" and "decode".  In monolingual environments,
once _all_ human-readable text (specifically including Python programs
and console I/O) is automatically mapped to a Python (unicode) string,
most programmers will never need to think about it as long as Python
(the project) very very strongly encourages that all Python programs
be written in UTF-8 if there's any chance the program will be reused
in a locale other than the one where it was written.  (Alternatively
you can depend on PEP 263 coding cookies.)  Then the user (or the
Python interpreter) just changes console and file I/O codecs to the
encoding in use in that locale, and everything just works.

So the remaining uses of "encode" and "decode" are for advanced users
and specialists: people using stuff like base64 or gzip, and those who
need to use unicode codecs explicitly.

I could be wrong about the possibility to get rid of explicit unicode
codec use in monolingual environments, but I hope that we can at least
try to achieve that.

>> Unlikely.  Errors like "A
>> string".encode("base64").encode("base64") are all too easy to
>> commit in practice.

Ron> Yes,... and wouldn't the above just result in a copy so it
Ron> wouldn't be an out right error.

No, you either get the following:

A string. -> QSBzdHJpbmcu -> UVNCemRISnBibWN1

or you might get an error if base64 is defined as bytes->unicode.

Ron> * Given that the string type gains a __codec__ attribute
Ron> to handle automatic decoding when needed.  (is there a reason
Ron> not to?)

Ron>str(object[,codec][,error]) -> string coded with codec

Ron>unicode(object[,error]) -> unicode

Ron>bytes(object) -> bytes

str == unicode in Py3k, so this is a non-starter.  What do you want to
say?

Ron>  * a recode() method is used for transformations that
Ron> *do_not* change the current codec.

I'm not sure what you mean by the "current codec".  If it's attached
to an "encoded object", it should be the codec needed to decode the
object.  And it should be allowed to be a "codec stack".  So suppose
you start with a unicode object "obj".  Then

>>> bytes = bytes (obj, 'utf-8')# implicit .encode()
>>> print bytes.codec
['utf-8']
>>> wire = bytes.encode ('base64')  # with apologies to Greg E.
>>> print wire.codec
['base64', 'utf-8']
>>> obj2 = wire.decode ('gzip')
CodecMatchException
>>> obj2 = wire.decode (wire.codec)
>>> print obj == obj2
True
>>> print obj2.codec
[]

or maybe None for the last.  I think this would be very nice as a
basis for improving the email module (for one), but I don't really
think it belongs in Python core.

Ron> That may be why it wasn't done this way to start.  (?)

I suspect the real reason is that Marc-Andre had the generalized codec
in mind from Day 0, and your proposal only works with duck-typing if
codecs always have a well-defined signature with two different types
for the argument and return of the "constructor".

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-23 Thread Greg Ewing
Stephen J. Turnbull wrote:

> Please define "character," and explain how its semantics map to
> Python's unicode objects.

One of the 65 abstract entities referred to in the RFC
and represented in that RFC by certain visual glyphs.
There is a subset of the Unicode code points that
are conventionally associated with very similar glyphs,
so that there is an obvious one-to-one mapping between
these entities and those Unicode code points. These
entities therefore have a natural and obvious
representation using Python unicode strings.

> No, base64 isn't a wire protocol.  Rather, it's a schema for a family
> of wire protocols, whose alphabets are heuristically chosen on the
> assumption that code units which happen to correspond to alpha-numeric
> code points in a commonly-used coded character set are more likely to
> pass through a communication channel without corruption.

Yes, and it's up to the programmer to choose those code
units (i.e. pick an encoding for the characters) that
will, in fact, pass through the channel he is using
without corruption. I don't see how any of this is
inconsistent with what I've said.

> Only if you do no transformations that will harm the base64-encoding.
> ...  It doesn't allow any of the
> usual transformations on characters that might be applied globally to
> a mail composition buffer, for example.

I don't understand that. Obviously if you rot13 your
mail message or turn it into pig latin or something,
it's going to mess up any base64 it might contain.
But that would be a silly thing to do to a message
containing base64.

Given any piece of text, there are things it makes
sense to do with it and things it doesn't, depending
entirely on the use to which the text will eventually
be put. I don't see how base64 is any different in
this regard.

> So then you bring it right back in with base64.  Now they need to know
> about bytes<->unicode codecs.

No, they need to know about the characteristics of
the channel over which they're sending the data.

Base64 is designed for situations in which you
have a *text* channel that you know is capable of
transmitting at least a certain subset of characters,
where "character" means whatever is used as input
to that channel.

In Py3k, text will be represented by unicode strings.
So a Py3k text channel should take unicode as its
input, not bytes.

I think we've got a bit sidetracked by talking about
mime. I wasn't actually thinking about mime, but
just a plain text message into which some base64
data was being inserted. That's the way we used to
do things in the old days with uuencode etc, before
mime was invented.

Here, the "channel" is NOT the socket or whatever
that the ultimate transmission takes place over --
it's the interface to your mail sending software
that takes a piece of plain text and sends it off
as a mail message somehow.

In Py3k, if a channel doesn't take unicode as input,
then it's not a text channel, and it's not appropriate
to be using base64 with it directly. It might be
appropriate to to use base64 followed by some encoding,
but the programmer needs to be aware of that and
choose the encoding wisely. It's not possible to
shield him from having to know about encodings in
that situation, even if the encoding is just ascii.
Trying to do so will just lead to more confusion,
in my opinion.

Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Ron Adam
Stephen J. Turnbull wrote:
>> "Ron" == Ron Adam <[EMAIL PROTECTED]> writes:
> 
> Ron> Terry Reedy wrote:
> 
> >> I prefer the shorter names and using recode, for instance, for
> >> bytes to bytes.
> 
> Ron> While I prefer constructors with an explicit encode argument,
> Ron> and use a recode() method for 'like to like' coding. 
> 
> 'Recode' is a great name for the conceptual process, but the methods
> are directional.  Also, in internationalization work, "recode"
> strongly connotes "encodingA -> original -> encodingB", as in iconv.

We could call it transform or translate if needed.  Words are reused 
constantly in languages, so I don't think it's a sticking point.  As 
long as its meaning is documented well and doesn't change later, I think 
it would be just fine.  If the concept of not having encode and decode 
as methods work, (and has support other than me) the name can be decided 
later.

> I do prefer constructors, as it's generally not a good idea to do
> encoding/decoding in-place for human-readable text, since the codecs
> are often lossy.
> 
> Ron> Then the whole encode/decode confusion goes away.
> 
> Unlikely.  Errors like "A string".encode("base64").encode("base64")
> are all too easy to commit in practice.

Yes,... and wouldn't the above just result in a copy so it wouldn't be 
an out right error.  But I understand that you mean similar cases where 
it would change the bytes with consecutive calls.  In any case, I was 
referring to the confusion with the method names and how they are used.


This is how I was thinking of it.

* Given that the string type gains a __codec__ attribute to handle 
automatic decoding when needed.   (is there a reason not to?)

   str(object[,codec][,error]) -> string coded with codec

   unicode(object[,error]) -> unicode

   bytes(object) -> bytes

 * a recode() method is used for transformations that *do_not* 
change the current codec.

See any problems with it?  (Other than from gross misuse of course and 
your dislike of 'recode' as the name.)

There may still be a __decode__() method on strings to do the actual 
decoding, but it wouldn't be part of the public interface.  Or it could 
call a function from the codec to do it.

 return self.codec.decode(self)

The only catching point I see is if having an additional attribute on 
strings would increase the memory which many small strings would use. 
That may be why it wasn't done this way to start.  (?)

Cheers,
Ron



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
> "Ron" == Ron Adam <[EMAIL PROTECTED]> writes:

Ron> Terry Reedy wrote:

>> I prefer the shorter names and using recode, for instance, for
>> bytes to bytes.

Ron> While I prefer constructors with an explicit encode argument,
Ron> and use a recode() method for 'like to like' coding. 

'Recode' is a great name for the conceptual process, but the methods
are directional.  Also, in internationalization work, "recode"
strongly connotes "encodingA -> original -> encodingB", as in iconv.

I do prefer constructors, as it's generally not a good idea to do
encoding/decoding in-place for human-readable text, since the codecs
are often lossy.

Ron> Then the whole encode/decode confusion goes away.

Unlikely.  Errors like "A string".encode("base64").encode("base64")
are all too easy to commit in practice.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> Stephen J. Turnbull wrote:

>> Base64 is a (family of) wire protocol(s).  It's not clear to me
>> that it makes sense to say that the alphabets used by "baseNN"
>> encodings are composed of characters,

Greg> Take a look at [this that the other]

Those references use "character" in an ambiguous and ill-defined way.
Trying to impose Python unicode object semantics on "vague characters"
is a bad idea IMO.

Greg> Which seems to make it perfectly clear that the result of
Greg> the encoding is to be considered as characters, which are
Greg> not necessarily going to be encoded using ascii.

Please define "character," and explain how its semantics map to
Python's unicode objects.

Greg> So base64 on its own is *not* a wire protocol. Only after
Greg> encoding the characters do you have a wire protocol.

No, base64 isn't a wire protocol.  Rather, it's a schema for a family
of wire protocols, whose alphabets are heuristically chosen on the
assumption that code units which happen to correspond to alpha-numeric
code points in a commonly-used coded character set are more likely to
pass through a communication channel without corruption.

Note that I have _precisely_ defined what I mean.  You still have the
problem that you haven't defined character, and that is a real
problem, see below.

>> I don't see any case for "correctness" here, only for
>> convenience,

Greg> I'm thinking of convenience, too. Keep in mind that in Py3k,
Greg> 'unicode' will be called 'str' (or something equally neutral
Greg> like 'text') and you will rarely have to deal explicitly
Greg> with unicode codings, this being done mostly for you by the
Greg> I/O objects. So most of the time, using base64 will be just
Greg> as convenient as it is today: base64_encode(my_bytes) and
Greg> write the result out somewhere.

Convenient, yes, but incorrect.  Once you mix those bytes with the
Python string type, they become subject to all the usual operations on
characters, and there's no way for Python to tell you that you didn't
want to do that.  Ie,

Greg> Whereas if the result is text, the right thing happens
Greg> automatically whatever the ultimate encoding turns out to
Greg> be. You can take the text from your base64 encoding, combine
Greg> it with other text from any other source to form a complete
Greg> mail message or xml document or whatever, and write it out
Greg> through a file object that's using any unicode encoding at
Greg> all, and the result will be correct.

Only if you do no transformations that will harm the base64-encoding.
This is why I say base64 is _not_ based on characters, at least not in
the way they are used in Python strings.  It doesn't allow any of the
usual transformations on characters that might be applied globally to
a mail composition buffer, for example.

In other words, you don't escape from the programmer having to know
what he's doing.  EIBTI, and the setup I advocate forces the
programmer to explicitly decide where to convert base64 objects to a
textual representation.  This reminds him that he'd better not touch
that text.

Greg> The reason I say it's *corrrect* is that if you go straight
Greg> from bytes to bytes, you're *assuming* the eventual encoding
Greg> is going to be an ascii superset.  The programmer is going
Greg> to have to know about this assumption and understand all its
Greg> consequences and decide whether it's right, and if not, do
Greg> something to change it.

I'm not assuming any such thing, except in the context of analysis of
implementation efficiency.  And the programmer needs to know about the
semantics of text that is actually a base64-encoded object, and that
they are different from string semantics.

This is something that programmers are used to dealing with in the
case of Python 2.x str and C char[]; the whole point of the unicode
type is to allow the programmer to abstract from that when dealing
human-readable text.  Why confuse the issue.

>> And in the classroom, you're just going to confuse students by
>> telling them that UTF-8 --[Unicode codec]--> Python string is
>> decoding but UTF-8 --[base64 codec]--> Python string is
>> encoding, when MAL is telling them that --> Python string is
>> always decoding.

Greg> Which is why I think that only *unicode* codings should be
Greg> available through the .encode and .decode interface. Or
Greg> alternatively there should be something more explicit like
Greg> .unicode_encode and .unicode_decode that is thus restricted.

Greg> Also, if most unicode coding is done in the I/O objects,
Greg> there will be far less need for programmers to do explicit
Greg> unicode coding in the first place, so likely it will become
Greg> more of an advanced topic, rather than something you need to
Greg> come to grips with on 

Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
James Y Knight wrote:

> Some MIME  sections 
> might have a base64 Content-Transfer-Encoding, others might  be 8bit 
> encoded, others might be 7bit encoded, others might be quoted- printable 
> encoded.

I stand corrected -- in that situation you would have to encode
the characters before combining them with other material.

However, this doesn't change my view that the result of base64
encoding by itself is characters, not bytes. To go straight
to bytes would require assuming an encoding, and that would
make it *harder* to use in cases where you wanted a different
encoding, because you'd first have to undo the default
encoding and then re-encode it using the one you wanted.

It may be reasonable to provide an easy way to go straight
from raw bytes to ascii-encoded-base64 bytes, but that should
be a different codec. The plain base64 codec should produce
text.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Ron Adam wrote:

> While I prefer constructors with an explicit encode argument, and use a 
> recode() method for 'like to like' coding.  Then the whole encode/decode 
> confusion goes away.

I'd be happy with that, too.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Terry Reedy wrote:
> "Greg Ewing" <[EMAIL PROTECTED]> wrote in message 
>
>>Efficiency is an implementation concern.
>
> It is also a user concern, especially if inefficiency overruns memory 
> limits.

Sure, but what I mean is that it's better to find what's
conceptually right and then look for an efficient way
of implementing it, rather than letting the implementation
drive the design.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Ron Adam
Terry Reedy wrote:

> "Greg Ewing" <[EMAIL PROTECTED]> wrote in message 
>
>> Which is why I think that only *unicode* codings should be
>> available through the .encode and .decode interface. Or
>> alternatively there should be something more explicit like
>> .unicode_encode and .unicode_decode that is thus restricted.
> 
> I prefer the shorter names and using recode, for instance, for bytes to 
> bytes.

While I prefer constructors with an explicit encode argument, and use a 
recode() method for 'like to like' coding.  Then the whole encode/decode 
confusion goes away.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Terry Reedy

"Greg Ewing" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]
> Efficiency is an implementation concern.

It is also a user concern, especially if inefficiency overruns memory 
limits.

> In Py3k, strings
> which contain only ascii or latin-1 might be stored as
> 1 byte per character, in which case this would not be an
> issue.

If 'might' becomes 'will', I and I suspect others will be happier with the 
change.  And I would be happy if the choice of physical storage was pretty 
much handled behind the scenes, as with the direction int/long unification 
is going.

> Which is why I think that only *unicode* codings should be
> available through the .encode and .decode interface. Or
> alternatively there should be something more explicit like
> .unicode_encode and .unicode_decode that is thus restricted.

I prefer the shorter names and using recode, for instance, for bytes to 
bytes.

Terry Jan Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread James Y Knight

On Feb 22, 2006, at 6:35 AM, Greg Ewing wrote:

> I'm thinking of convenience, too. Keep in mind that in Py3k,
> 'unicode' will be called 'str' (or something equally neutral
> like 'text') and you will rarely have to deal explicitly with
> unicode codings, this being done mostly for you by the I/O
> objects. So most of the time, using base64 will be just as
> convenient as it is today: base64_encode(my_bytes) and write
> the result out somewhere.
>
> The reason I say it's *corrrect* is that if you go straight
> from bytes to bytes, you're *assuming* the eventual encoding
> is going to be an ascii superset. The programmer is going to
> have to know about this assumption and understand all its
> consequences and decide whether it's right, and if not, do
> something to change it.
>
> Whereas if the result is text, the right thing happens
> automatically whatever the ultimate encoding turns out to
> be. You can take the text from your base64 encoding, combine
> it with other text from any other source to form a complete
> mail message or xml document or whatever, and write it out
> through a file object that's using any unicode encoding
> at all, and the result will be correct.

This makes little sense for mail. You combine *bytes*, in various and  
possibly different encodings to form a mail message. Some MIME  
sections might have a base64 Content-Transfer-Encoding, others might  
be 8bit encoded, others might be 7bit encoded, others might be quoted- 
printable encoded. Before the C-T-E encoding, you will have had to do  
the Content-Type encoding, coverting your text into bytes with the  
desired character encoding: utf-8, iso-8859-1, etc. Having the final  
mail message be made up of "characters", right before transmission to  
the socket would be crazy.

James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Greg Ewing
Stephen J. Turnbull wrote:

> Base64 is a (family of) wire protocol(s).  It's not clear to me that
> it makes sense to say that the alphabets used by "baseNN" encodings
> are composed of characters,

Take a look at

   http://en.wikipedia.org/wiki/Base64

where it says

   ...base64 is a binary to text encoding scheme whereby an
   arbitrary sequence of bytes is converted to a sequence of
   printable ASCII characters.

Also see RFC 2045 (http://www.ietf.org/rfc/rfc2045.txt) which
defines base64 in terms of an encoding from octets to characters,
and also says

   A 65-character subset of US-ASCII is used ... This subset has
   the important property that it is represented identically in
   all versions of ISO 646 ... and all characters in the subset
   are also represented identically in all versions of EBCDIC.

Which seems to make it perfectly clear that the result
of the encoding is to be considered as characters, which
are not necessarily going to be encoded using ascii.

So base64 on its own is *not* a wire protocol. Only after
encoding the characters do you have a wire protocol.

> I don't see any case for
> "correctness" here, only for convenience,

I'm thinking of convenience, too. Keep in mind that in Py3k,
'unicode' will be called 'str' (or something equally neutral
like 'text') and you will rarely have to deal explicitly with
unicode codings, this being done mostly for you by the I/O
objects. So most of the time, using base64 will be just as
convenient as it is today: base64_encode(my_bytes) and write
the result out somewhere.

The reason I say it's *corrrect* is that if you go straight
from bytes to bytes, you're *assuming* the eventual encoding
is going to be an ascii superset. The programmer is going to
have to know about this assumption and understand all its
consequences and decide whether it's right, and if not, do
something to change it.

Whereas if the result is text, the right thing happens
automatically whatever the ultimate encoding turns out to
be. You can take the text from your base64 encoding, combine
it with other text from any other source to form a complete
mail message or xml document or whatever, and write it out
through a file object that's using any unicode encoding
at all, and the result will be correct.

 > it's also efficient to use bytes<->bytes for XML, since
> conversion of base64 bytes to UTF-8 characters is simply a matter of
> "Simon says, be UTF-8!"

Efficiency is an implementation concern. In Py3k, strings
which contain only ascii or latin-1 might be stored as
1 byte per character, in which case this would not be an
issue.

> And in the classroom, you're just going to confuse students by telling
> them that UTF-8 --[Unicode codec]--> Python string is decoding but
> UTF-8 --[base64 codec]--> Python string is encoding, when MAL is
> telling them that --> Python string is always decoding.

Which is why I think that only *unicode* codings should be
available through the .encode and .decode interface. Or
alternatively there should be something more explicit like
.unicode_encode and .unicode_decode that is thus restricted.

Also, if most unicode coding is done in the I/O objects, there
will be far less need for programmers to do explicit unicode
coding in the first place, so likely it will become more of
an advanced topic, rather than something you need to come to
grips with on day one of using unicode, like it is now.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-22 Thread Stephen J. Turnbull
> "Greg" == Greg Ewing <[EMAIL PROTECTED]> writes:

Greg> Stephen J. Turnbull wrote:

>> What I advocate for Python is to require that the standard
>> base64 codec be defined only on bytes, and always produce
>> bytes.

Greg> I don't understand that. It seems quite clear to me that
Greg> base64 encoding (in the general sense of encoding, not the
Greg> unicode sense) takes binary data (bytes) and produces
Greg> characters.

Base64 is a (family of) wire protocol(s).  It's not clear to me that
it makes sense to say that the alphabets used by "baseNN" encodings
are composed of characters, but suppose we stipulate that.

Greg> So in Py3k the correct usage would be [bytes<->unicode].

IMHO, as a wire protocol, base64 simply doesn't care what Python's
internal representation of characters is.  I don't see any case for
"correctness" here, only for convenience, both for programmers on the
job and students in the classroom.  We can choose the character set
that works best for us.  I think that's 8-bit US ASCII.

My belief is that bytes<->bytes is going to be the dominant use case,
although I don't use binary representation in XML.  However, AFAIK for
on the wire use UTF-8 is strongly recommended for XML, and in that
case it's also efficient to use bytes<->bytes for XML, since
conversion of base64 bytes to UTF-8 characters is simply a matter of
"Simon says, be UTF-8!"

And in the classroom, you're just going to confuse students by telling
them that UTF-8 --[Unicode codec]--> Python string is decoding but
UTF-8 --[base64 codec]--> Python string is encoding, when MAL is
telling them that --> Python string is always decoding.

Sure, it all makes sense if you already know what's going on.  But I
have trouble remembering, especially in cases like UTF-8 vs UTF-16
where Perl and Python have opposite internal representations, and
glibc has a third which isn't either.  If base64 (and gzip, etc) are
all considered bytes<->bytes, there just isn't an issue any more.  The
simple rule wins: to Python string is always decoding.

Why fight it when we can run away with efficiency gains?

(In the above, "Python string" means the unicode type, not str.)

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Greg Ewing
Josiah Carlson wrote:

> It doesn't seem strange to you to need to encode data twice to be able
> to have a usable sequence of characters which can be embedded in an
> effectively 7-bit email;

I'm talking about a 3.0 world where all strings are unicode
and the unicode <-> external coding is for the most part
done automatically by the I/O objects. So you'd be building
up your whole email as a string (aka unicode) which happens
to only contain code points in the range 0..127, and then
writing it to your socket or whatever. You wouldn't need
to do the second encoding step explicitly very often.

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | Carpe post meridiam! |
Christchurch, New Zealand  | (I'm not a morning person.)  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Josiah Carlson

Greg Ewing <[EMAIL PROTECTED]> wrote:
> 
> Stephen J. Turnbull wrote:
> 
> > What I advocate for Python is to require that the standard base64
> > codec be defined only on bytes, and always produce bytes.
> 
> I don't understand that. It seems quite clear to me that
> base64 encoding (in the general sense of encoding, not the
> unicode sense) takes binary data (bytes) and produces characters.
> That's the whole point of base64 -- so you can send arbitrary
> data over a channel that is only capable of dealing with
> characters.
> 
> So in Py3k the correct usage would be
> 
>base64unicode
>encodeencode(x)
>original bytes > unicode -> bytes for transmission
>   < <-
>base64unicode
>decodedecode(x)
> 
> where x is whatever unicode encoding the transmission
> channel uses for characters (probably ascii or an ascii
> superset, but not necessarily).

It doesn't seem strange to you to need to encode data twice to be able
to have a usable sequence of characters which can be embedded in an
effectively 7-bit email; when base64 was, dare I say it, designed to
have 7-bit email as its destination in the first place?  It does to me.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Barry Warsaw
On Sun, 2006-02-19 at 23:30 +0900, Stephen J. Turnbull wrote:
> > "M" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

> M> * for Unicode codecs the original form is Unicode, the derived
> M> form is, in most cases, a string
> 
> First of all, that's Martin's point!
> 
> Second, almost all Americans, a large majority of Japanese, and I
> would bet most Western Europeans would say you have that backwards.
> That's the problem, and it's the Unicode advocates' problem (ie,
> ours), not the users'.  Even if we're right: education will require
> lots of effort.  Rather, we should just make it as easy as possible to
> do it right, and hard to do it wrong.

I think you've hit the nail squarely on the head.  Even though I /know/
what the intended semantics are, the originality of the string form is
deeply embedded in my nearly 30 years of programming experience, almost
all of it completely American English-centric.  

I always have to stop and think about which direction .encode()
and .decode() go in because it simply doesn't feel natural.  Or more
simply put, my brain knows what's right, but my heart doesn't and that's
why converting from one to the other is always a hiccup in the smooth
flow of coding.  And while I'm sympathetic to MAL's design decisions,
the overlaying of the generalizations doesn't help.

-Barry



signature.asc
Description: This is a digitally signed message part
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-21 Thread Greg Ewing
Stephen J. Turnbull wrote:

> What I advocate for Python is to require that the standard base64
> codec be defined only on bytes, and always produce bytes.

I don't understand that. It seems quite clear to me that
base64 encoding (in the general sense of encoding, not the
unicode sense) takes binary data (bytes) and produces characters.
That's the whole point of base64 -- so you can send arbitrary
data over a channel that is only capable of dealing with
characters.

So in Py3k the correct usage would be

   base64unicode
   encodeencode(x)
   original bytes > unicode -> bytes for transmission
  < <-
   base64unicode
   decodedecode(x)

where x is whatever unicode encoding the transmission
channel uses for characters (probably ascii or an ascii
superset, but not necessarily).

So, however it's spelled, the typing is such that

base64_encode(bytes) --> unicode

and

base64_decode(unicode) --> bytes

--
Greg



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Bob Ippolito

On Feb 20, 2006, at 7:25 PM, Stephen J. Turnbull wrote:

>> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:
>
> Martin> Please do take a look. It is the only way: If you were to
> Martin> embed base64 *bytes* into character data content of an XML
> Martin> element, the resulting XML file might not be well-formed
> Martin> anymore (if the encoding of the XML file is not an ASCII
> Martin> superencoding).
>
> Excuse me, I've been doing category theory recently.  By "embedding" I
> mean a map from an intermediate object which is a stream of bytes to
> the corresponding stream of characters.  In the case of UTF-16-coded
> characters, this would necessarily imply a representation change, as
> you say.
>
> What I advocate for Python is to require that the standard base64
> codec be defined only on bytes, and always produce bytes.  Any
> representation change should be done explicitly.  This is surely
> conformant with RFC 2045's definition and with RFC 3548.

+1

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> Please do take a look. It is the only way: If you were to
Martin> embed base64 *bytes* into character data content of an XML
Martin> element, the resulting XML file might not be well-formed
Martin> anymore (if the encoding of the XML file is not an ASCII
Martin> superencoding).

Excuse me, I've been doing category theory recently.  By "embedding" I
mean a map from an intermediate object which is a stream of bytes to
the corresponding stream of characters.  In the case of UTF-16-coded
characters, this would necessarily imply a representation change, as
you say.

What I advocate for Python is to require that the standard base64
codec be defined only on bytes, and always produce bytes.  Any
representation change should be done explicitly.  This is surely
conformant with RFC 2045's definition and with RFC 3548.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-20 Thread Ron Adam
Bengt Richter wrote:
> On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters <[EMAIL PROTECTED]> wrote:

> note what base64 really is for. It's essence is to create a _character_ 
> sequence
> which can succeed in being encoded as ascii. The concept of base64 going 
> str->str
> is really a mental shortcut for s_str.decode('base64').encode('ascii'), where
> 3 octets are decoded as code for 4 characters modulo padding logic.

Wouldn't it be...

obj.encode('base64').encode('ascii')

This would probably also work...

obj.encode('base64').decode('ascii')  ->  ascii alphabet in unicode

Where the underlying sequence might be ...

obj -> bytes -> bytes:base64 -> base64 ascii character set

The point is to have the data in a safe to transmit form that can 
survive being encoded and decoded into different forms along the 
transmission path and still be restored at the final destination.

base64 ascii character set -> bytes:base64 -> original bytes -> obj




* a related note, derived from this and your other post in this thread.

If the str type constructor had an encode argument like the unicode type 
does, along with a str.encoded_with attribute.  Then it might be 
possible to depreciate the .decode() and .encode() methods and remove 
them form P3k entirely or use them as data coders/decoders instead of 
char type encoders.

It could also create a clear separation between character encodings and 
data coding.  The following should give an exception.

   str(str, 'rot13'))

Rot13 isn't a character encoding, but a data coding method.

   data_str.encode('rot13')   # could be ok

But this wouldn't...

   new_str = data_str.encode('latin_1')# could cause an exception

We'd have to use...

   new_str = str(data_str, 'latin_1')  # New string sub type...


Cheers,
Ronald Adam

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
> Martin> For an example where base64 is *not* necessarily
> Martin> ASCII-encoded, see the "binary" data type in XML
> Martin> Schema. There, base64 is embedded into an XML document,
> Martin> and uses the encoding of the entire XML document. As a
> Martin> result, you may get base64 data in utf16le.
> 
> I'll have to take a look.  It depends on whether base64 is specified
> as an octet-stream to Unicode stream transformation or as an embedding
> of an intermediate representation into Unicode.  Granted, defining the
> base64 alphabet as a subset of Unicode seems like the logical way to
> do it in the context of XML.

Please do take a look. It is the only way: If you were to embed base64
*bytes* into character data content of an XML element, the resulting
XML file might not be well-formed anymore (if the encoding of the XML
file is not an ASCII superencoding).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
> "Josiah" == Josiah Carlson <[EMAIL PROTECTED]> writes:

Josiah> I try to internalize it by not thinking of strings as
Josiah> encoded data, but as binary data, and unicode as text.  I
Josiah> then remind myself that unicode isn't native on-disk or
Josiah> cross-network (which stores and transports bytes, not
Josiah> characters), so one needs to encode it as binary data.
Josiah> It's a subtle difference, but it has worked so far for me.

Seems like a lot of work for something that for monolingual usage
should "Just Work" almost all of the time.

Josiah> I notice that you seem to be in Japan, so teaching unicode
Josiah> is a must.

Yes.  Japan is more complicated than that, but in Python unicode is a
must.

Josiah> If you are using the "unicode is text" and "strings are
Josiah> data", and they aren't getting it; then I don't know.

Well, I can tell you that they don't get it.  One problem is PEP 263.
It makes it very easy to write programs that do line-oriented I/O with
input() and print, and the students come to think it should always be
that easy.  Since Japan has at least 6 common encodings that students
encounter on a daily basis while browsing the web, plus a couple more
that live inside of MSFT Word and Java, they're used to huge amounts
of magic.  The normal response of novice programmers is to mandate
that users of their programs use the encoding of choice and put it in
ordinary strings so that it just works.

Ie, the average student just "eats" the F on the codecs assignment,
and writes the rest of her programs without them.

>> simple, and the exceptions for using a "nonexistent" method
>> mean I don't have to reinforce---the students will be able to
>> teach each other.  The exceptions also directly help reinforce
>> the notion that text == Unicode.

Josiah> Are you sure that they would help?  If .encode() and
Josiah> .decode() drop from strings and unicode (respectively),
Josiah> they get an AttributeError.  That's almost useless.

Well, I'm not _sure_, but this is the kind of thing that you can learn
by rote.  And it will happen on a sufficiently regular basis that a
large fraction of students will experience it.  They'll ask each
other, and usually they'll find a classmate who knows what happened.

I haven't tried this with codecs, but that's been my experience with
statistical packages where some routines understand non-linear
equations but others insist on linear equations.[1] The error messages
("Equation is non-linear!  Aaugh!") are not much more specific than
AttributeError.

Josiah> Raising a better exception (with more information) would
Josiah> be better in that case, but losing the functionality that
Josiah> either would offer seems unnecessary;

Well, the point is that for the "usual suspects" (ie, Unicode codecs)
there is no functionality that would be lost.  As MAL pointed out, for
these codecs the "original" text is always Unicode; that's the role
Unicode is designed for, and by and large it fits the bill very well.
With few exceptions (such as rot13) the "derived" text will be bytes
that peripherals such as keyboards and terminals can generate and
display.

Josiah> "You are trying to encode/decode to/from incompatible
Josiah> types. expected: a->b got: x->y" is better.  Some of those
Josiah> can be done *very soon*, given the capabilities of the
Josiah> encodings module,

That's probably the way to go.

If we can have a derived "Unicode codec" class that does this, that
would pretty much entirely serve the need I perceive.  Beginning
students could learn to write iconv.py, more advanced students could
learn to create codec stacks to generate MIME bodies, which could
include base64 or quoted-printable bytes -> bytes codecs.



Footnotes: 
[1]  If you're not familiar with regression analysis, the problem is
that the equation "z = a*log(x) + b*log(y)" where a and b are to be
estimated is _linear_ in the sense that x, y, and z are data series,
and X = log(x) and Y = log(y) can be precomputed so that the equation
actually computed is "z = a*X + b*Y".  On the other hand "z = a*(x +
b*y)" is _nonlinear_ because of the coefficient on y being a*b.
Students find this hard to grasp in the classroom, but they learn
quickly in the lab.

I believe the parameter/variable inversion that my students have
trouble with in statistics is similar to the "original"/"derived"
inversion that happens with "text you can see" (derived, string) and
"abstract text inside the program" (original, Unicode).

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mail

Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-20 Thread Bengt Richter
On Sat, 18 Feb 2006 23:33:15 +0100, Thomas Wouters <[EMAIL PROTECTED]> wrote:

>On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:
>
[...]
>> >  - The return value for the non-unicode encodings depends on the value of
>> >the encoding argument.
>
>> Not really: you'll always get a basestring instance.
>
But actually basestring is weird graft of semantic apples and empty bags IMO.
unicode is essentially an abstract character vector type,
and str is an abstract binary octet vector type having nothing to do with 
characters
except by inferential association with an encoding.

>Which is not a particularly useful distinction, since in any real world
>application, you have to be careful not to mix unicode with (non-ascii)
>bytestrings. The only way to reliably deal with unicode is to have it
>well-contained (when migrating an application from using bytestrings to
>using unicode) or to use unicode everywhere, decoding/encoding at
>entrypoints. Containment is hard to achieve.
>
>> Still, I believe that this is an educational problem. There are
>> a couple of gotchas users will have to be aware of (and this is
>> unrelated to the methods in question):
>> 
>> * "encoding" always refers to transforming original data into
>>   a derived form
ISTM encoding separates type information from the source and sets it aside
as the identity of the encoding, and renders the data in a composite of
more primitive types, octets being the most primitive short of bits.

>> 
>> * "decoding" always refers to transforming a derived form of
>>   data back into its original form
Decoding of a composite of primitives requires additional separate information
(namely identification of the encoding) to create a higher composite type.
>> 
>> * for Unicode codecs the original form is Unicode, the derived
>>   form is, in most cases, a string
You mean a str instance, right? Where the original type as character vector
is gone. That's not a string in the sense of character string.
>> 
>> As a result, if you want to use a Unicode codec such as utf-8,
>> you encode Unicode into a utf-8 string and decode a utf-8 string
>> into Unicode.
s/string/str instance/
>> 
>> Encoding a string is only possible if the string itself is
>> original data, e.g. some data that is supposed to be transformed
>> into a base64 encoded form.
note what base64 really is for. It's essence is to create a _character_ sequence
which can succeed in being encoded as ascii. The concept of base64 going 
str->str
is really a mental shortcut for s_str.decode('base64').encode('ascii'), where
3 octets are decoded as code for 4 characters modulo padding logic.

>> 
>> Decoding Unicode is only possible if the Unicode string itself
>> represents a derived form, e.g. a sequence of hex literals.
Again, it's an abbreviation, e.g. 
print u'4cf6776973'.encode('hex_chars_to_octets').decode('latin-1')
Should print Löwis

>
>Most of these gotchas would not have been gotchas had encode/decode only
>been usable for unicode encodings.
>
>> > That is why I disagree with the hypergeneralization of the encode/decode
>> > methods
>[..]
>> That's because you only look at one specific task.
>
>> Codecs also unify the various interfaces to common encodings
>> such as base64, uu or zip which are not Unicode related.
I think the trouble is that these view the transformations as octets->octets
whereas IMO decoding should always result in a container type that knows what 
it is
semantically without association with external use-this-codec information. IOW,

octets.decode('zip') -> archive
archive.encode('bzip') -> octets

You could even subclass octet to make archive that knows it's an octet vector
representing a decoded zip, so it can have an encode method that could
(specifying 'zip' again) encode itself back to the original zip, or an alternate
method to encode itself as something else, which you couldn't do from plain 
octets
without specifying both transformations at once. (hence the .recode idea, but I 
don't
think that is as pure. The constructor for the container type could also be 
used, like
Archive(octets, 'zip') analogous to unicode('abc', 'ascii')

IOW 
octets + decoding info -> container type instance
container type instance + encoding info -> octets
>
>No, I think you misunderstand. I object to the hypergeneralization of the
>*encode/decode methods*, not the codec system. I would have been fine with
>another set of methods for non-unicode transformations. Although I would
>have been even more fine if they got their encoding not as a string, but as,
>say, a module object, or something imported from a module.
>
>Not that I think any of this matters; we have what we have and I'll have to
>live with it ;)
Probably.
BTW, you may notice I'm saying octet instead of bytes. I have another post on 
that,
arguing that the basic binary information type should be octet, since binary 
files
are made of octets that have no instrinsic numerical or character significance.
See other post i

Re: [Python-Dev] bytes.from_hex()

2006-02-20 Thread Stephen J. Turnbull
> "Martin" == Martin v Löwis <[EMAIL PROTECTED]> writes:

Martin> Stephen J. Turnbull wrote:

Bengt> The characters in b could be encoded in plain ascii, or
Bengt> utf16le, you have to know.

>> Which base64 are you thinking about?  Both RFC 3548 and RFC
>> 2045 (MIME) specify subsets of US-ASCII explicitly.

Martin> Unfortunately, it is ambiguous as to whether they refer to
Martin> US-ASCII, the character set, or US-ASCII, the encoding.

True for RFC 3548, but the authors of RFC 2045 clearly had the
encoding in mind, since they depend on RFC 822.

Martin> It appears that RFC 3548 talks about the character set
Martin> only:

OK, although RFC 3548 cites RFC 20 (!) as its source for US-ASCII,
which clearly has bytes (though not necessarily octets) in mind, it
doesn't actually restrict base encoding to be a subset of US-ASCII.

On the other hand, RFC 3548 doesn't define "base64" (or any other base
encoding), it simply provides a set of requirements that a conforming
implementation must satisfy.  Python can therefore choose to define
its base64 as a bytes->bytes codec, with the alphabet drawn from
US-ASCII interpreted as encoding.

I would definitely prefer that, as png_image = unicode.encode('base64')
violates MAL's intuitive schema for the method.

Martin> For an example where base64 is *not* necessarily
Martin> ASCII-encoded, see the "binary" data type in XML
Martin> Schema. There, base64 is embedded into an XML document,
Martin> and uses the encoding of the entire XML document. As a
Martin> result, you may get base64 data in utf16le.

I'll have to take a look.  It depends on whether base64 is specified
as an octet-stream to Unicode stream transformation or as an embedding
of an intermediate representation into Unicode.  Granted, defining the
base64 alphabet as a subset of Unicode seems like the logical way to
do it in the context of XML.

P.S.  My apologies for munging your name in the To: header.  I'm
having problems with my MUA.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Josiah Carlson

"Stephen J. Turnbull" <[EMAIL PROTECTED]> wrote:
> 
> > "Josiah" == Josiah Carlson <[EMAIL PROTECTED]> writes:
> 
> Josiah> The question remains: is str.decode() returning a string
> Josiah> or unicode depending on the argument passed, when the
> Josiah> argument quite literally names the codec involved,
> Josiah> difficult to understand?  I don't believe so; am I the
> Josiah> only one?
> 
> Do you do any of the user education *about codec use* that you
> recommend?  The people I try to teach about coding invariably find it
> difficult to understand.  The problem is that the near-universal
> intuition is that for "human-usable text" is pretty much anything *but
> Unicode* will do.  This is a really hard block to get them past.
> There is very good reason why Unicode is plain text ("original" in
> MAL's terms) and everything else is encoded ("derived"), but students
> new to the concept often take a while to "get" it.

I've not been teaching Python; when I was still a TA, it was strictly
algorithms and data structures.  Of those people who I have had the
opportunity to entice into Python, I've not followed up on their
progress to know if they had any issues.

I try to internalize it by not thinking of strings as encoded data, but
as binary data, and unicode as text.  I then remind myself that unicode
isn't native on-disk or cross-network (which stores and transports bytes,
not characters), so one needs to encode it as binary data.  It's a
subtle difference, but it has worked so far for me.

In my experience, at least for only-English speaking users, most people
don't even get to unicode.  I didn't even touch it until I had been well
versed with the encoding and decoding of all different kinds of binary
data, when a half-dozen international users (China, Japan, Russia, ...)
requested its support in my source editor; so I added it.  Supporting it
properly hasn't been very difficult, and the only real nit I have
experienced is supporting the encoding line just after the #! line for
arbitrary codecs (sometimes saving a file in a particular encoding dies).

I notice that you seem to be in Japan, so teaching unicode is a must. 
If you are using the "unicode is text" and "strings are data", and they
aren't getting it; then I don't know.


> Maybe it's just me, but whether it's the teacher or the students, I am
> *not* excited about the education route.  Martin's simple rule *is*
> simple, and the exceptions for using a "nonexistent" method mean I
> don't have to reinforce---the students will be able to teach each
> other.  The exceptions also directly help reinforce the notion that
> text == Unicode.

Are you sure that they would help?  If .encode() and .decode() drop from
strings and unicode (respectively), they get an AttributeError.  That's
almost useless.  Raising a better exception (with more information)
would be better in that case, but losing the functionality that either
would offer seems unnecessary; which is why I had suggested some of the
other method names.  Perhaps a "This method was removed because it
confused users.  Use help(str.encode) (or unicode.decode) to find out
how you can do the equivalent, or do what you *really* wanted to do."


> I grant the point that .decode('base64') is useful, but I also believe
> that "education" is a lot more easily said than done in this case.

What I meant by "education" is 'better documentation' and 'better
exception messages'.  I didn't learn Python by sitting in a class; I
learned it by going through the tutorial over a weekend as a 2nd year
undergrad and writing software which could do what I wanted/needed.
Compared to the compiler messages I'd been seeing from Codewarrior and
MSVC 6, Python exceptions were like an oracle.  I can understand how
first-time programmers can have issues with *some* Python exception
messages, which is why I think that we could use better ones.  There is
also the other issue that sometimes people fail to actually read the
messages.

Again, I don't believe that an AttributeError is any better than an
"ordinal not in range(128)", but "You are trying to encode/decode
to/from incompatible types. expected: a->b got: x->y" is better.  Some
of those can be done *very soon*, given the capabilities of the
encodings module, and they could likely be easily migrated, regardless
of the decisions with .encode()/.decode() .

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Bob Ippolito
On Feb 19, 2006, at 10:55 AM, Martin v. Löwis wrote:

> Stephen J. Turnbull wrote:
>> BTW, what use cases do you have in mind for Unicode -> Unicode
>> decoding?
>
> I think "rot13" falls into that category: it is a transformation
> on text, not on bytes.

The current implementation is a transformation on bytes, not text.   
Conceptually though, it's a text->text transform.

> For other "odd" cases: "base64" goes Unicode->bytes in the *decode*
> direction, not in the encode direction. Some may argue that base64
> is bytes, not text, but in many applications, you can combine base64
> (or uuencode) with abitrary other text in a single stream. Of course,
> it could be required that you go u.encode("ascii").decode("base64").

I would say that base64 is bytes->bytes.  Just because those bytes  
happen to be in a subset of ASCII, it's still a serialization meant  
for wire transmission.  Sometimes it ends up in unicode (e.g. in  
XML), but that's the exception not the rule.

-bob

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
> Bengt> The characters in b could be encoded in plain ascii, or
> Bengt> utf16le, you have to know.
> 
> Which base64 are you thinking about?  Both RFC 3548 and RFC 2045
> (MIME) specify subsets of US-ASCII explicitly.

Unfortunately, it is ambiguous as to whether they refer to US-ASCII,
the character set, or US-ASCII, the encoding. It appears that
RFC 3548 talks about the character set only:

- section 2.4 talks about "choosing an alphabet", and how it should
  be possible for humans to handle such data.
- section 2.3 talks about non-alphabet characters

So it appears that RFC 3548 defines a conversion bytes->text.
To transmit this, you then also need encoding. MIME appears
to also use the US-ASCII *encoding* ("charset", in IETF speak),
for the "base64" Content-Transfer-Encoding.

For an example where base64 is *not* necessarily ASCII-encoded,
see the "binary" data type in XML Schema. There, base64 is embedded
into an XML document, and uses the encoding of the entire XML
document. As a result, you may get base64 data in utf16le.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
> Do you do any of the user education *about codec use* that you
> recommend?  The people I try to teach about coding invariably find it
> difficult to understand.  The problem is that the near-universal
> intuition is that for "human-usable text" is pretty much anything *but
> Unicode* will do.

It really is a matter of education. For the first time in my career,
I have been teaching the first-semester programming course, and I
was happy to see that the text book already has a section on text
and Unicode (actually, I selected the text book also based on whether
there was good discussion of that aspect). So I spent quite some
time with data representation (integrals, floats, characters), and
I hope that the students now "got it".

If they didn't learn it that way in the first semester (or already
got mis-educated in highschool), it will be very hard for them to
relearn. So I expect that it will take a decade or two until this
all is common knowledge.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Martin v. Löwis
Stephen J. Turnbull wrote:
> BTW, what use cases do you have in mind for Unicode -> Unicode
> decoding?

I think "rot13" falls into that category: it is a transformation
on text, not on bytes.

For other "odd" cases: "base64" goes Unicode->bytes in the *decode*
direction, not in the encode direction. Some may argue that base64
is bytes, not text, but in many applications, you can combine base64
(or uuencode) with abitrary other text in a single stream. Of course,
it could be required that you go u.encode("ascii").decode("base64").

> def encode-mime-body (string, codec-list):
> if codec-list[0] not in charset-codec-list:
> raise NotCharsetCodecException
> if len (codec-list) > 1 and codec-list[-1] not in transfer-codec-list:
> raise NotTransferCodecException
> for codec in codec-list:
> string = string.encode (codec)
> return string
> 
> mime-body = encode-mime-body ("This is a pen.",
>   [ 'shift_jis', 'zip', 'base64' ])

I think this is an example where you *should* use the codec API,
as designed. As that apparently requires streams for stacking (ie.
no support for codec stacking), you would have to write

def encode_mime_body(string, codec_list):
stack = output = cStringIO.StringIO()
for codec in reversed(codec_list):
stack = codecs.getwriter(codec)(stack)
stack.write(string)
stack.reset()
return output.getValue()

Notice that you have to start the stacking with the last codec,
and you have to keep a reference to the StringIO object where
the actual bytes end up.

Regards,
Martin

P.S. there shows some LISP through in your Python code :-)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "Bengt" == Bengt Richter <[EMAIL PROTECTED]> writes:

Bengt> The characters in b could be encoded in plain ascii, or
Bengt> utf16le, you have to know.

Which base64 are you thinking about?  Both RFC 3548 and RFC 2045
(MIME) specify subsets of US-ASCII explicitly.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "Bob" == Bob Ippolito <[EMAIL PROTECTED]> writes:

Bob> On Feb 17, 2006, at 8:33 PM, Josiah Carlson wrote:

>> But you aren't always getting *unicode* text from the decoding
>> of bytes, and you may be encoding bytes *to* bytes:

Please note that I presumed that you can indeed assume that decoding
of bytes always results in unicode, and encoding of unicode always
results in bytes.  I believe Guido made the proposal relying on that
assumption too.  The constructor notation makes no sense for making an
object of the same type as the original unless it's a copy constructor.

You could argue that the base64 language is indeed a different
language from the bytes language, and I'd agree.  But since there's no
way in Python to determine whether a string that conforms to base64 is
supposed to be base64 or bytes, it would be a very bad idea to
interpret the distinction as one of type.

>> b2 = bytes(b, "base64")

>> b3 = bytes(b2, "base64")

>> Which direction are we going again?

Bob> This is *exactly* why the current set of codecs are INSANE.
Bob> unicode.encode and str.decode should be used *only* for
Bob> unicode codecs.  Byte transforms are entirely different
Bob> semantically and should be some other method pair.

General filters are semantically different, I agree.  But "encode" and
"decode" in English are certainly far more general than character
coding conversion.  The use of those methods for any stream conversion
that is invertible (eg, compression or encryption) is not insane.
It's just pedagogically inconvenient given the existing confusion
(outside of python-dev, of course) about character coding
issues.

I'd like to rephrase your statement as "*only* unicode.encode and
str.decode should be used for unicode codecs".  Ie, str.encode(codec)
and unicode.decode(codec) should raise errors if codec is a "unicode
codec".  The question in my mind is whether we should allow other
kinds of codecs or not.

I could live with "not", but if we're going to have other kinds
of codecs, I think they should have concrete signatures.  Ie,
basestring -> basestring shouldn't be allowed.  Content transfer
encodings like BASE64 and quoted-printable, compression, encryption,
etc IMO should be bytes -> bytes.  Overloading to unicode -> unicode
is sorta plausible for BASE64 or QP, but YAGNI.  OTOH, the Unicode
standard does define a number of unicode -> unicode transformations,
and it might make sense to generalize to case conversions etc.  (Note
that these conversions are pseudo-invertible, so you can think of them
as generalized .encode/.decode pairs.  The inverse is usually the
identity, which seems weird, but from the pedagogical standpoint you
could handle that weirdness by raising an error if the .encode method
were invoked.)

To be concrete, I could imagine writing

s2 = s1.decode('upcase')
if s2 == s1:
print "Why are you shouting at me?"
else:
print "I like calm, well-spoken snakes."

s3 = s2.encode('upcase')
if s3 == s2:
print "Never fails!"
else:
print "See a vet; your Python is *very* sick."

I chose the decode method to do the non-trivial transformation because
.decode()'s value is supposed to be "original" text in MAL's terms.
And that's true of uppercase-only text; you're still supposed to be
able to read it, so I guess it's not "encoded".  That's pretty
pedantic; I think it's better to raise on .encode('upcase').


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "Josiah" == Josiah Carlson <[EMAIL PROTECTED]> writes:

Josiah> The question remains: is str.decode() returning a string
Josiah> or unicode depending on the argument passed, when the
Josiah> argument quite literally names the codec involved,
Josiah> difficult to understand?  I don't believe so; am I the
Josiah> only one?

Do you do any of the user education *about codec use* that you
recommend?  The people I try to teach about coding invariably find it
difficult to understand.  The problem is that the near-universal
intuition is that for "human-usable text" is pretty much anything *but
Unicode* will do.  This is a really hard block to get them past.
There is very good reason why Unicode is plain text ("original" in
MAL's terms) and everything else is encoded ("derived"), but students
new to the concept often take a while to "get" it.

Maybe it's just me, but whether it's the teacher or the students, I am
*not* excited about the education route.  Martin's simple rule *is*
simple, and the exceptions for using a "nonexistent" method mean I
don't have to reinforce---the students will be able to teach each
other.  The exceptions also directly help reinforce the notion that
text == Unicode.

I grant the point that .decode('base64') is useful, but I also believe
that "education" is a lot more easily said than done in this case.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "M" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

M> The main reason is symmetry and the fact that strings and
M> Unicode should be as similar as possible in order to simplify
M> the task of moving from one to the other.

Those are perfectly compatible with Martin's suggestion.

M> Still, I believe that this is an educational problem. There are
M> a couple of gotchas users will have to be aware of (and this is
M> unrelated to the methods in question):

But IMO that's wrong, both in attitude and in fact.  As for attitude,
users should not have to be aware of these gotchas.  Codec writers, on
the other hand, should be required to avoid presenting users with
those gotchas.  Martin's draconian restriction is in the right
direction, but you can argue it goes way too far.

In fact, of course it's related to the methods in question.
"Original" vs "derived" data can only be defined in terms of some
notion of the "usual semantics" of the streams, and that is going to
be strongly reflected in the semantics of the methods.

M> * "encoding" always refers to transforming original data into a
M> derived form

M> * "decoding" always refers to transforming a derived form of
M> data back into its original form

Users *already* know that; it's a very strong connotation of the
English words.  The problem is that users typically have their own
concept of what's original and what's derived.  For example:

M> * for Unicode codecs the original form is Unicode, the derived
M> form is, in most cases, a string

First of all, that's Martin's point!

Second, almost all Americans, a large majority of Japanese, and I
would bet most Western Europeans would say you have that backwards.
That's the problem, and it's the Unicode advocates' problem (ie,
ours), not the users'.  Even if we're right: education will require
lots of effort.  Rather, we should just make it as easy as possible to
do it right, and hard to do it wrong.

BTW, what use cases do you have in mind for Unicode -> Unicode
decoding?  Maximally decomposed forms and/or eliminating compatibility
characters etc?  Very specialized.

M> Codecs also unify the various interfaces to common encodings
M> such as base64, uu or zip which are not Unicode related.

Now this is useful and has use cases I've run into, for example in
email, where you would like to use the same interface for base64 as
for shift_jis and you'd like to be able to write

def encode-mime-body (string, codec-list):
if codec-list[0] not in charset-codec-list:
raise NotCharsetCodecException
if len (codec-list) > 1 and codec-list[-1] not in transfer-codec-list:
raise NotTransferCodecException
for codec in codec-list:
string = string.encode (codec)
return string

mime-body = encode-mime-body ("This is a pen.",
  [ 'shift_jis', 'zip', 'base64' ])


I guess I have to admit I'm backtracking from my earlier hardline
support for Martin's position, but I'm still sympathetic: (a) that's
the direct way to "make it easy to do it right", and (b) I still think
the use cases for non-Unicode codecs are YAGNI very often.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "M" == "M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

M> Martin v. Löwis wrote:

>> No. The reason to ban string.decode and bytes.encode is that it
>> confuses users.

M> Instead of starting to ban everything that can potentially
M> confuse a few users, we should educate those users and tell
M> them what these methods mean and how they should be used.

ISTM it's neither "potential" nor "a few".

As Aahz pointed out, for the common use of text I/O it requires only a
single clue ("Unicode is The One True Plain Text, everything else must
be decoded to Unicode before use.") and you don't need any "education"
about "how to use" codecs under Martin's restrictions; you just need
to know which ones to use.

This is not a benefit to be given up lightly.

Would it be reasonable to put those restrictions in the codecs?  Ie,
so that bytes().encode('gzip') is allowed for the "generic" codec
'gzip', but bytes().encode('utf-8') is an error for the "charset"
codec 'utf-8'?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Stephen J. Turnbull
> "Ian" == Ian Bicking <[EMAIL PROTECTED]> writes:

Ian> Encodings cover up eclectic interfaces, where those
Ian> interfaces fit a basic pattern -- data in, data out.

Isn't "filter" the word you're looking for?

I think you've just made a very strong case that this is a slippery
slope that we should avoid.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-19 Thread Michael Hudson
"M.-A. Lemburg" <[EMAIL PROTECTED]> writes:

> Martin v. Löwis wrote:
>> M.-A. Lemburg wrote:
> True. However, note that the .encode()/.decode() methods on
> strings and Unicode narrow down the possible return types.
> The corresponding .bytes methods should only allow bytes and
> Unicode.
 I forgot that: what is the rationale for that restriction?
>>>
>>> To assure that only those types can be returned from those
>>> methods, ie. instances of basestring, which in return permits
>>> type inference for those methods.
>> 
>> Hmm. So it for type inference
>> Where is that documented?
>
> Somewhere in the python-dev mailing list archives ;-)
>
> Seriously, we should probably add this to the documentation.

Err.. I don't think this is a good argument, for quite
a few reasons.  There certainly aren't many other features in Python
designed to aid type inference and the knowledge that something
returns "unicode or str" isn't especially useful...

Cheers,
mwh

-- 
  ROOSTA:  Ever since you arrived on this planet last night you've
   been going round telling people that you're Zaphod
   Beeblebrox, but that they're not to tell anyone else.
-- The Hitch-Hikers Guide to the Galaxy, Episode 7
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Josiah Carlson wrote:
> Ron Adam <[EMAIL PROTECTED]> wrote:


> Except that ambiguates it even further.
>
> Is encodings.tounicode() encoding, or decoding?  According to everything
> you have said so far, it would be decoding.  But if I am decoding binary
> data, why should it be spending any time as a unicode string?  What do I
> mean?

Encoding and decoding are relative concepts.  It's all encoding from one
thing to another.  Weather it's "decoding" or "encoding" depends on the
relationship of the current encoding to a standard encoding.

The confusion introduced by "decode" is when the 'default_encoding'
changes, will change, or is unknown.


> x = f.read() #x contains base-64 encoded binary data
> y = encodings.to_unicode(x, 'base64')
> 
> y now contains BINARY DATA, except that it is a unicode string

No, that wasn't what I was describing.  You get a Unicode string object
as the result, not a bytes object with binary data.  See the toy example
at the bottom.


> z = encodings.to_str(y, 'latin-1')
> 
> Later you define a str_to_str function, which I (or someone else) would
> use like:
> 
> z = str_to_str(x, 'base64', 'latin-1')
> 
> But the trick is that I don't want some unicode string encoded in
> latin-1, I want my binary data unencoded.  They may happen to be the
> same in this particular example, but that doesn't mean that it makes any
> sense to the user.

If you want bytes then you would use the bytes() type to get bytes
directly.  Not encode or decode.

 binary_unicode = bytes(unicode_string)

The exact byte order and representation would need to be decided by the
python developers in this case.  The internal representation
'unicode-internal', is UCS-2 I believed.



>> It's no more ambiguous than any math 
>> operation where you can do it one way with one operations and get your 
>> original value back with the same operation by using an inverse value.
>>
>> n2=n+1; n3=n+(-1); n==n3
>> n2=n*2; n3=n*(.5); n==n3
> 
> Ahh, so you are saying 'to_base64' and 'from_base64'.  There is one
> major reason why I don't like that kind of a system: I can't just say
> encoding='base64' and use str.encode(encoding) and str.decode(encoding),
> I necessarily have to use, str.recode('to_'+encoding) and
> str.recode('from_'+encoding) .  Seems a bit awkward.

Yes, but the encodings API could abstract out the 'to_base64' and
'from_base64' so you can just say 'base64' and have it work either way.

Maybe a toy "incomplete" example might help.



# in module bytes.py or someplace else.
class bytes(list):
   """
   bytes methods defined here
   """



# in module encodings.py

# using a dict of lists, but other solutions would
# work just as well.
unicode_codecs = {
   'base64': ('from_base64', 'to_base64'),
   }

def tounicode(obj, from_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[from_codec][0])
return unicode(b)

def tostr(obj, to_codec):
b = bytes(obj)
b = b.recode(unicode_codecs[to_codec][1])
return str(b)



# in your application

import encodings

... a bunch of code ...

u = encodings.tounicode(s, 'base64')

# or if going the other way

s = encodings.tostr(u, 'base64')



Does this help?  Is the relationship between the bytes object and the
encodings API clearer here?  If not maybe we should discuss it further
off line.


Cheers,
Ronald Adam
















___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Josiah Carlson

Ron Adam <[EMAIL PROTECTED]> wrote:
> Josiah Carlson wrote:
> > Ron Adam <[EMAIL PROTECTED]> wrote:
> >> Josiah Carlson wrote:
> > [snip]
> >>> Again, the problem is ambiguity; what does bytes.recode(something) mean?
> >>> Are we encoding _to_ something, or are we decoding _from_ something? 
> >> This was just an example of one way that might work, but here are my 
> >> thoughts on why I think it might be good.
> >>
> >> In this case, the ambiguity is reduced as far as the encoding and 
> >> decodings opperations are concerned.)
> >>
> >>   somestring = encodings.tostr( someunicodestr, 'latin-1')
> >>
> >> It's pretty clear what is happening to me.
> >>
> >>  It will encode to a string an object, named someunicodestr, with 
> >> the 'latin-1' encoder.
> > 
> > But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
> > unicode(..., 'latin-1')?
> 
> Yes, Just do.
> 
>   someunicodestr = encoding.tounicode( somestring, 'latin-1')
> 
> > What about string transformations:
> > somestring = encodings.tostr(somestr, 'base64')
>  >
> > How do we get that back?  encodings.tostr() again is completely
> > ambiguous, str(somestring, 'base64') seems a bit awkward (switching
> > namespaces)?
> 
> In the case where a string is converted to another string. It would 
> probably be best to have a requirement that they all get converted to 
> unicode as an intermediate step.  By doing that it becomes an explicit 
> two step opperation.
> 
>  # string to string encoding
>  u_string = encodings.tounicode(s_string, 'base64')
>  s2_string = encodings.tostr(u_string, 'base64')

Except that ambiguates it even further.

Is encodings.tounicode() encoding, or decoding?  According to everything
you have said so far, it would be decoding.  But if I am decoding binary
data, why should it be spending any time as a unicode string?  What do I
mean?

x = f.read() #x contains base-64 encoded binary data
y = encodings.to_unicode(x, 'base64')

y now contains BINARY DATA, except that it is a unicode string

z = encodings.to_str(y, 'latin-1')

Later you define a str_to_str function, which I (or someone else) would
use like:

z = str_to_str(x, 'base64', 'latin-1')

But the trick is that I don't want some unicode string encoded in
latin-1, I want my binary data unencoded.  They may happen to be the
same in this particular example, but that doesn't mean that it makes any
sense to the user.

[...]

> >>> What about .reencode and .redecode?  It seems as
> >>> though the 're' added as a prefix to .encode and .decode makes it
> >>> clearer that you get the same type back as you put in, and it is also
> >>> unambiguous to direction.
> 
> ...
> 
>  > I must not be expressing myself very well.
>  >
> > Right now:
> > s.encode() -> s
> > s.decode() -> s, u
> > u.encode() -> s, u
> > u.decode() -> u
> > 
> > Martin et al's desired change to encode/decode:
> > s.decode() -> u
> > u.encode() -> s
>  >
>  > No others.
> 
> Which would be similar to the functions I suggested.  The main 
> difference is only weather it would be better to have them as methods or 
> separate factory functions and the spelling of the names.  Both have 
> their advantages I think.

While others would disagree, I personally am not a fan of to* or from*
style namings, for either function names (especially in the encodings
module) or methods.  Just a personal preference.

Of course, I don't find the current situation regarding
str/unicode.encode/decode to be confusing either, but maybe it's because
my unicode experience is strictly within the realm of GUI widgets, where
compartmentalization can be easier.


> >> The method bytes.recode(), always does a byte transformation which can 
> >> be almost anything.  It's the context bytes.recode() is used in that 
> >> determines what's happening.  In the above cases, it's using an encoding 
> >> transformation, so what it's doing is precisely what you would expect by 
> >> it's context.

[THIS IS THE AMBIGUITY]
> > Indeed, there is a translation going on, but it is not clear as to
> > whether you are encoding _to_ something or _from_ something.  What does
> > s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
> > That's where the ambiguity lies.
> 
> Bengt didn't propose adding .recode() to the string types, but only the 
> bytes type.  The byte type would "recode" the bytes using a specific 
> transformation.  I believe his view is it's a lower level API than 
> strings that can be used to implement the higher level encoding API 
> with, not replace the encoding API.  Or that is they way I interpreted 
> the suggestion.

But again, what would the transformation be?  To something?  From
something?  'to_base64', 'from_base64', 'to_rot13' (which happens to be
identical to) 'from_rot13', ...  Saying it would "recode ... using a
specific transformation" is a cop-out, what would the translation be? 
How would it work?  How would it be sp

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam

Josiah Carlson wrote:
> Ron Adam <[EMAIL PROTECTED]> wrote:
>> Josiah Carlson wrote:
> [snip]
>>> Again, the problem is ambiguity; what does bytes.recode(something) mean?
>>> Are we encoding _to_ something, or are we decoding _from_ something? 
>> This was just an example of one way that might work, but here are my 
>> thoughts on why I think it might be good.
>>
>> In this case, the ambiguity is reduced as far as the encoding and 
>> decodings opperations are concerned.)
>>
>>   somestring = encodings.tostr( someunicodestr, 'latin-1')
>>
>> It's pretty clear what is happening to me.
>>
>>  It will encode to a string an object, named someunicodestr, with 
>> the 'latin-1' encoder.
> 
> But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
> unicode(..., 'latin-1')?

Yes, Just do.

  someunicodestr = encoding.tounicode( somestring, 'latin-1')



> What about string transformations:
> somestring = encodings.tostr(somestr, 'base64')
 >
> How do we get that back?  encodings.tostr() again is completely
> ambiguous, str(somestring, 'base64') seems a bit awkward (switching
> namespaces)?

In the case where a string is converted to another string. It would 
probably be best to have a requirement that they all get converted to 
unicode as an intermediate step.  By doing that it becomes an explicit 
two step opperation.

 # string to string encoding
 u_string = encodings.tounicode(s_string, 'base64')
 s2_string = encodings.tostr(u_string, 'base64')

Or you could have a convenience function to do it in the encodings 
module also.

def strtostr(s, sourcecodec, destcodec):
u = tounicode(s, sourcecodec)
return tostr(u, destcodec)

Then...

s2 = encodings.strtostr(s, 'base64, 'base64)

Which would be kind of pointless in this example, but it would be a good 
way to test a codec.

assert s == s2


>>> Are we going to need to embed the direction in the encoding/decoding
>>> name (to_base64, from_base64, etc.)?  That doesn't any better than
>>> binascii.b2a_base64 .  
>> No, that's why I suggested two separate lists (or dictionaries might be 
>> better).  They can contain the same names, but the lists they are in 
>> determine the context and point to the needed codec.  And that step is 
>> abstracted out by putting it inside the encodings.tostr() and 
>> encodings.tounicode() functions.
>>
>> So either function would call 'base64' from the correct codec list and 
>> get the correct encoding or decoding codec it needs.
> 
> Either the API you have described is incomplete, you haven't noticed the
> directional ambiguity you are describing, or I have completely lost it.

Most likely I gave an incomplete description of the API in this case 
because there are probably several ways to implement it.



>>> What about .reencode and .redecode?  It seems as
>>> though the 're' added as a prefix to .encode and .decode makes it
>>> clearer that you get the same type back as you put in, and it is also
>>> unambiguous to direction.

...

 > I must not be expressing myself very well.
 >
> Right now:
> s.encode() -> s
> s.decode() -> s, u
> u.encode() -> s, u
> u.decode() -> u
> 
> Martin et al's desired change to encode/decode:
> s.decode() -> u
> u.encode() -> s
 >
 > No others.

Which would be similar to the functions I suggested.  The main 
difference is only weather it would be better to have them as methods or 
separate factory functions and the spelling of the names.  Both have 
their advantages I think.


>> The method bytes.recode(), always does a byte transformation which can 
>> be almost anything.  It's the context bytes.recode() is used in that 
>> determines what's happening.  In the above cases, it's using an encoding 
>> transformation, so what it's doing is precisely what you would expect by 
>> it's context.
> 
> Indeed, there is a translation going on, but it is not clear as to
> whether you are encoding _to_ something or _from_ something.  What does
> s.recode('base64') mean?  Are you encoding _to_ base64 or _from_ base64? 
> That's where the ambiguity lies.

Bengt didn't propose adding .recode() to the string types, but only the 
bytes type.  The byte type would "recode" the bytes using a specific 
transformation.  I believe his view is it's a lower level API than 
strings that can be used to implement the higher level encoding API 
with, not replace the encoding API.  Or that is they way I interpreted 
the suggestion.


>> There isn't a bytes.decode(), since that's just another transformation. 
>> So only the one method is needed.  Which makes it easer to learn.
> 
> But ambiguous.

What's ambiguous about it?  It's no more ambiguous than any math 
operation where you can do it one way with one operations and get your 
original value back with the same operation by using an inverse value.

n2=n+1; n3=n+(-1); n==n3
n2=n*2; n3=n*(.5); n==n3


>> Learning how the current system works comes awfully close to reverse 
>

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Terry Reedy

"Josiah Carlson" <[EMAIL PROTECTED]> wrote in message 
news:[EMAIL PROTECTED]

> Again, the problem is ambiguity; what does bytes.recode(something) mean?
> Are we encoding _to_ something, or are we decoding _from_ something?
> Are we going to need to embed the direction in the encoding/decoding
> name (to_base64, from_base64, etc.)?

To me, that seems simple and clear.  b.recode('from_base64') obviously 
requires that b meet the restrictions of base64.  Similarly for 'from_hex'.

> That doesn't any better than binascii.b2a_base64

I think 'from_base64' is *much* better.  I think there are now 4 
string-to-string transform modules that do similar things.  Not optimal to 
me.

 >What about .reencode and .redecode?  It seems as
> though the 're' added as a prefix to .encode and .decode makes it
> clearer that you get the same type back as you put in, and it is also
> unambiguous to direction.

To me, the 're' prefix is awkward, confusing, and misleading.

Terry J. Reedy



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Thomas Wouters
On Sat, Feb 18, 2006 at 01:21:18PM +0100, M.-A. Lemburg wrote:

> It's by no means a Perl attitude.

In your eyes, perhaps. It certainly feels that way to me (or I wouldn't have
said it :). Perl happens to be full of general constructs that were added
because they were easy to add, or they were useful in edgecases. The
encode/decode methods remind me of that, even though I fully understand the
reasoning behind it, and the elegance of the implementation.

> The main reason is symmetry and the fact that strings and Unicode
> should be as similar as possible in order to simplify the task of
> moving from one to the other.

Yes, and this is a design choice I don't agree with. They're different
types. They do everything similarly, except when they are mixed together
(unicode takes precedence, in general, encoding the bytestring from the
default encoding.) Going from one to the other isn't symmetric, though. I
understand that you disagree; the disagreement is on the fundamental choice
of allowing 'encode' and 'decode' to do *more* than going from and to
unicode. I regret that decision, not the decision to make encode and decode
symmetric (which makes sense, after the decision to overgeneralize
encode/decode is made.)

> >  - The return value for the non-unicode encodings depends on the value of
> >the encoding argument.

> Not really: you'll always get a basestring instance.

Which is not a particularly useful distinction, since in any real world
application, you have to be careful not to mix unicode with (non-ascii)
bytestrings. The only way to reliably deal with unicode is to have it
well-contained (when migrating an application from using bytestrings to
using unicode) or to use unicode everywhere, decoding/encoding at
entrypoints. Containment is hard to achieve.

> Still, I believe that this is an educational problem. There are
> a couple of gotchas users will have to be aware of (and this is
> unrelated to the methods in question):
> 
> * "encoding" always refers to transforming original data into
>   a derived form
> 
> * "decoding" always refers to transforming a derived form of
>   data back into its original form
> 
> * for Unicode codecs the original form is Unicode, the derived
>   form is, in most cases, a string
> 
> As a result, if you want to use a Unicode codec such as utf-8,
> you encode Unicode into a utf-8 string and decode a utf-8 string
> into Unicode.
> 
> Encoding a string is only possible if the string itself is
> original data, e.g. some data that is supposed to be transformed
> into a base64 encoded form.
> 
> Decoding Unicode is only possible if the Unicode string itself
> represents a derived form, e.g. a sequence of hex literals.

Most of these gotchas would not have been gotchas had encode/decode only
been usable for unicode encodings.

> > That is why I disagree with the hypergeneralization of the encode/decode
> > methods
[..]
> That's because you only look at one specific task.

> Codecs also unify the various interfaces to common encodings
> such as base64, uu or zip which are not Unicode related.

No, I think you misunderstand. I object to the hypergeneralization of the
*encode/decode methods*, not the codec system. I would have been fine with
another set of methods for non-unicode transformations. Although I would
have been even more fine if they got their encoding not as a string, but as,
say, a module object, or something imported from a module.

Not that I think any of this matters; we have what we have and I'll have to
live with it ;)

-- 
Thomas Wouters <[EMAIL PROTECTED]>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Aahz wrote:
> On Sat, Feb 18, 2006, Ron Adam wrote:
>> I like the bytes.recode() idea a lot. +1
>>
>> It seems to me it's a far more useful idea than encoding and decoding by 
>> overloading and could do both and more.  It has a lot of potential to be 
>> an intermediate step for encoding as well as being used for many other 
>> translations to byte data.
>>
>> I think I would prefer that encode and decode be just functions with 
>> well defined names and arguments instead of being methods or arguments 
>> to string and Unicode types.
>>
>> I'm not sure on exactly how this would work. Maybe it would need two 
>> sets of encodings, ie.. decoders, and encoders.  An exception would be
>> given if it wasn't found for the direction one was going in.
> 
> Here's an idea I don't think I've seen before:
> 
> bytes.recode(b, src_encoding, dest_encoding)
> 
> This requires the user to state up-front what the source encoding is.
> One of the big problems that I see with the whole encoding mess is that
> so much of it contains implicit assumptions about the source encoding;
> this gets away from that.

Yes, but it's not just the encodings that are implicit, it is also the 
types.

s.encode(codec)  # explicit source type, ? dest type
s.decode(codec)  # explicit source type, ? dest type

encodings.tostr(obj, codec) # implicit *known* source type
# explicit dest type

encodings.tounicode(obj, codec) # implicit *known* source type
# explicit dest type

In this case the source is implicit, but there can be a well defined 
check to validate the source type against the codec being used.  It's my 
feeling the user *knows* what he already has, and so it's more important 
that the resulting object type is explicit.

In your suggestion...

bytes.recode(b, src_encoding, dest_incoding)

Here the encodings are both explicit, but the both the source and the 
destinations of the bytes are not.  Since it working on bytes, they 
could have come from anywhere, and after the translation they would then 
will be cast to the type the user *thinks* it should result in.  A 
source of errors that would likely pass silently.

The way I see it is the bytes type should be a lower level object that 
doesn't care what byte transformation it does. Ie.. they are all one way 
byte to byte transformations determined by context.  And it should have 
the capability to read from and write to types without translating in 
the same step.  Keep it simple.

Then it could be used as a lower level byte translator to implement 
encodings and other translations in encoding methods or functions 
instead of trying to make it replace the higher level functionality.

Cheers,
Ron

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Josiah Carlson

Ron Adam <[EMAIL PROTECTED]> wrote:
> Josiah Carlson wrote:
[snip]
> > Again, the problem is ambiguity; what does bytes.recode(something) mean?
> > Are we encoding _to_ something, or are we decoding _from_ something? 
> 
> This was just an example of one way that might work, but here are my 
> thoughts on why I think it might be good.
> 
> In this case, the ambiguity is reduced as far as the encoding and 
> decodings opperations are concerned.)
> 
>   somestring = encodings.tostr( someunicodestr, 'latin-1')
> 
> It's pretty clear what is happening to me.
> 
>  It will encode to a string an object, named someunicodestr, with 
> the 'latin-1' encoder.

But now how do you get it back?  encodings.tounicode(..., 'latin-1')?,
unicode(..., 'latin-1')?

What about string transformations:
somestring = encodings.tostr(somestr, 'base64')

How do we get that back?  encodings.tostr() again is completely
ambiguous, str(somestring, 'base64') seems a bit awkward (switching
namespaces)?


> And also rusult in clear errors if the specified encoding is 
> unavailable, and if it is, if it's not compatible with the given 
> *someunicodestr* obj type.
> 
> Further hints could be gained by.
> 
>  help(encodings.tostr)
> 
> Which could result in... something like...
>  """
>  encoding.tostr( ,  ) -> string
> 
>  Encode a unicode string using a encoder codec to a
>  non-unicode string or transform a non-unicode string
>  to another non-unicode string using an encoder codec.
>  """
> 
> And if that's not enough, then help(encodings) could give more clues. 
> These steps would be what I would do. And then the next thing would be 
> to find the python docs entry on encodings.
> 
> Placing them in encodings seems like a fairly good place to look for 
> these functions if you are working with encodings.  So I find that just 
> as convenient as having them be string methods.
> 
> There is no intermediate default encoding involved above, (the bytes 
> object is used instead), so you wouldn't get some of the messages the 
> present system results in when ascii is the default.
> 
> (Yes, I know it won't when P3K is here also)
> 
> > Are we going to need to embed the direction in the encoding/decoding
> > name (to_base64, from_base64, etc.)?  That doesn't any better than
> > binascii.b2a_base64 .  
> 
> No, that's why I suggested two separate lists (or dictionaries might be 
> better).  They can contain the same names, but the lists they are in 
> determine the context and point to the needed codec.  And that step is 
> abstracted out by putting it inside the encodings.tostr() and 
> encodings.tounicode() functions.
> 
> So either function would call 'base64' from the correct codec list and 
> get the correct encoding or decoding codec it needs.

Either the API you have described is incomplete, you haven't noticed the
directional ambiguity you are describing, or I have completely lost it.


> > What about .reencode and .redecode?  It seems as
> > though the 're' added as a prefix to .encode and .decode makes it
> > clearer that you get the same type back as you put in, and it is also
> > unambiguous to direction.
> 
> But then wouldn't we end up with multitude of ways to do things?
> 
>  s.encode(codec) == s.redecode(codec)
>  s.decode(codec) == s.reencode(codec)
>  unicode(s, codec) == s.decode(codec)
>  str(u, codec) == u.encode(codec)
>  str(s, codec) == s.encode(codec)
>  unicode(s, codec) == s.reencode(codec)
>  str(u, codec) == s.redecode(codec)
>  str(s, codec) == s.redecode(codec)
> 
> Umm .. did I miss any?  Which ones would you remove?
> 
> Which ones of those will succeed with which codecs?

I must not be expressing myself very well.

Right now:
s.encode() -> s
s.decode() -> s, u
u.encode() -> s, u
u.decode() -> u

Martin et al's desired change to encode/decode:
s.decode() -> u
u.encode() -> s

No others.

What my thoughts on .reencode() and .redecode() would get you given
Martin et al's desired change:
s.reencode() -> s (you get encoded strings as strings)
s.redecode() -> s (you get decoded strings as strings)
u.reencode() -> u (you get encoded unicode as unicode)
u.redecode() -> u (you get decoded unicode as unicode)

If one wants to go from unicode to string, one uses .encode(). If one
wants to go from string to unicode, one uses .decode().  If one wants to
keep their type unchanged, but encode or decode the data/text, one would
use .reencode() and .redecode(), depending on whether their source is an
encoded block of data, or the original data they want to encode.

The other bonus is that if given .reencode() and .redecode(), one can
quite easily verify that the source is possible as a source, and that
you would get back the proper type.  How this would occur behind the
scenes is beyond the scope of this discussion, but it seems to me to be
easy, given what I've read about the current mechanism.

Whether the constru

Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
 True. However, note that the .encode()/.decode() methods on
 strings and Unicode narrow down the possible return types.
 The corresponding .bytes methods should only allow bytes and
 Unicode.
>>> I forgot that: what is the rationale for that restriction?
>>
>> To assure that only those types can be returned from those
>> methods, ie. instances of basestring, which in return permits
>> type inference for those methods.
> 
> Hmm. So it for type inference
> Where is that documented?

Somewhere in the python-dev mailing list archives ;-)

Seriously, we should probably add this to the documentation.

> This looks pretty inconsistent. Either codecs can give arbitrary
> return types, then .encode/.decode should also be allowed to
> give arbitrary return types, or codecs should be restricted.

No.

As I've said before: the .encode() and .decode() methods
are convenience methods to interface to codecs which take
string/Unicode on input and create string/Unicode output.

> What's the point of first allowing a wide interface, and then
> narrowing it?

The codec interface is an abstract interface. It is a flexible
enough to allow codecs to define possible input and output
types while being strict about the method names and signatures.

Much like the file interface in Python, the copy protocol
or the pickle interface.

> Also, if type inference is the goal, what is the point in allowing
> two result types?

I'm not sure I understand the question: type inference is about
being able to infer the types of (among other things) function
return objects. This is what the restriction guarantees - much
like int() guarantees that you get either an integer or a long.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Martin v. Löwis
M.-A. Lemburg wrote:
>>>True. However, note that the .encode()/.decode() methods on
>>>strings and Unicode narrow down the possible return types.
>>>The corresponding .bytes methods should only allow bytes and
>>>Unicode.
>>
>>I forgot that: what is the rationale for that restriction?
> 
> 
> To assure that only those types can be returned from those
> methods, ie. instances of basestring, which in return permits
> type inference for those methods.

Hmm. So it for type inference
Where is that documented?

This looks pretty inconsistent. Either codecs can give arbitrary
return types, then .encode/.decode should also be allowed to
give arbitrary return types, or codecs should be restricted.
What's the point of first allowing a wide interface, and then
narrowing it?

Also, if type inference is the goal, what is the point in allowing
two result types?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>> I've already explained why we have .encode() and .decode()
>> methods on strings and Unicode many times. I've also
>> explained the misunderstanding that can codecs only do
>> Unicode-string conversions. And I've explained that
>> the .encode() and .decode() method *do* check the return
>> types of the codecs and only allow strings or Unicode
>> on return (no lists, instances, tuples or anything else).
>>
>> You seem to ignore this fact.
> 
> I'm not ignoring the fact that you have explained this
> many times. I just fail to understand your explanations.

Feel free to ask questions.

> For example, you said at some point that codecs are not
> restricted to Unicode. However, I don't recall any
> explanation what the restriction *is*, if any restriction
> exists. No such restriction seems to be documented.

The codecs are not restricted w/r to the data types
they work on. It's up to the codecs to define which
data types are valid and which they take on input and
return.

>> True. However, note that the .encode()/.decode() methods on
>> strings and Unicode narrow down the possible return types.
>> The corresponding .bytes methods should only allow bytes and
>> Unicode.
> 
> I forgot that: what is the rationale for that restriction?

To assure that only those types can be returned from those
methods, ie. instances of basestring, which in return permits
type inference for those methods.

The codecs functions encode() and decode() don't have these
restrictions, and thus provide a generic interface to the
codec's encode and decode functions. It's up to the caller
to restrict the allowed encodings and as result the
possible input/output types.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Martin v. Löwis
Michael Hudson wrote:
> There's one extremely significant example where the *value* of
> something impacts on the type of something else: functions.  The types
> of everything involved in str([1]) and len([1]) are the same but the
> results are different.  This shows up in PyPy's type annotation; most
> of the time we just track types indeed, but when something is called
> we need to have a pretty good idea of the potential values, too.
> 
> Relavent to the point at hand?  No.  Apologies for wasting your time
> :)

Actually, I think it is relevant. I never thought about it this way,
but now that you mention it, you are right.

This demonstrates that the string argument to .encode is actually
a function name, atleast the way it is implemented now. So
.encode("uu") and .encode("rot13") are *two* different methods,
instead of being a single method.

This brings me back to my original point: "rot13" should be a function,
not a parameter to some function. In essence, .encode reimplements
apply(), with the added feature of not having to pass the function
itself, but just its name.

Maybe this design results from a really deep understanding of

Namespaces are one honking great idea -- let's do more of those!

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Martin v. Löwis
M.-A. Lemburg wrote:
> I've already explained why we have .encode() and .decode()
> methods on strings and Unicode many times. I've also
> explained the misunderstanding that can codecs only do
> Unicode-string conversions. And I've explained that
> the .encode() and .decode() method *do* check the return
> types of the codecs and only allow strings or Unicode
> on return (no lists, instances, tuples or anything else).
> 
> You seem to ignore this fact.

I'm not ignoring the fact that you have explained this
many times. I just fail to understand your explanations.

For example, you said at some point that codecs are not
restricted to Unicode. However, I don't recall any
explanation what the restriction *is*, if any restriction
exists. No such restriction seems to be documented.

> True. However, note that the .encode()/.decode() methods on
> strings and Unicode narrow down the possible return types.
> The corresponding .bytes methods should only allow bytes and
> Unicode.

I forgot that: what is the rationale for that restriction?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread M.-A. Lemburg
Aahz wrote:
> On Sat, Feb 18, 2006, Ron Adam wrote:
>> I like the bytes.recode() idea a lot. +1
>>
>> It seems to me it's a far more useful idea than encoding and decoding by 
>> overloading and could do both and more.  It has a lot of potential to be 
>> an intermediate step for encoding as well as being used for many other 
>> translations to byte data.
>>
>> I think I would prefer that encode and decode be just functions with 
>> well defined names and arguments instead of being methods or arguments 
>> to string and Unicode types.
>>
>> I'm not sure on exactly how this would work. Maybe it would need two 
>> sets of encodings, ie.. decoders, and encoders.  An exception would be
>> given if it wasn't found for the direction one was going in.
> 
> Here's an idea I don't think I've seen before:
> 
> bytes.recode(b, src_encoding, dest_encoding)
> 
> This requires the user to state up-front what the source encoding is.
> One of the big problems that I see with the whole encoding mess is that
> so much of it contains implicit assumptions about the source encoding;
> this gets away from that.

You might want to look at the codecs.py module: it has all these
things and a lot more.

http://docs.python.org/lib/module-codecs.html
http://svn.python.org/view/python/trunk/Lib/codecs.py?view=markup

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Aahz
On Sat, Feb 18, 2006, Ron Adam wrote:
>
> I like the bytes.recode() idea a lot. +1
> 
> It seems to me it's a far more useful idea than encoding and decoding by 
> overloading and could do both and more.  It has a lot of potential to be 
> an intermediate step for encoding as well as being used for many other 
> translations to byte data.
> 
> I think I would prefer that encode and decode be just functions with 
> well defined names and arguments instead of being methods or arguments 
> to string and Unicode types.
> 
> I'm not sure on exactly how this would work. Maybe it would need two 
> sets of encodings, ie.. decoders, and encoders.  An exception would be
> given if it wasn't found for the direction one was going in.

Here's an idea I don't think I've seen before:

bytes.recode(b, src_encoding, dest_encoding)

This requires the user to state up-front what the source encoding is.
One of the big problems that I see with the whole encoding mess is that
so much of it contains implicit assumptions about the source encoding;
this gets away from that.
-- 
Aahz ([EMAIL PROTECTED])   <*> http://www.pythoncraft.com/

"19. A language that doesn't affect the way you think about programming,
is not worth knowing."  --Alan Perlis
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Adam Olsen
On 2/18/06, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> Look at what we've currently got going for data transformations in the
> standard library to see what these removals will do: base64 module,
> binascii module, binhex module, uu module, ...  Do we want or need to
> add another top-level module for every future encoding/codec that comes
> out (or does everyone think that we're done seeing codecs)?  Do we want
> to keep monkey-patching binascii with names like 'a2b_hqx'?  While there
> is currently one text->text transform (rot13), do we add another module
> for text->text transforms? Would it start having names like t2e_rot13()
> and e2t_rot13()?

If top-level modules are the problem then why not make codecs into a package?

from codecs import utf8, base64

utf8.encode(u) -> b
utf8.decode(b) -> u
base64.encode(b) -> b
base64.decode(b) -> b

--
Adam Olsen, aka Rhamphoryncus
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Thomas Wouters wrote:
> On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote:
> 
>> I've already explained why we have .encode() and .decode()
>> methods on strings and Unicode many times. I've also
>> explained the misunderstanding that can codecs only do
>> Unicode-string conversions. And I've explained that
>> the .encode() and .decode() method *do* check the return
>> types of the codecs and only allow strings or Unicode
>> on return (no lists, instances, tuples or anything else).
>>
>> You seem to ignore this fact.
> 
> Actually, I think the problem is that while we all agree the
> bytestring/unicode methods are a useful way to convert from bytestring to
> unicode and back again, we disagree on their *general* usefulness. Sure, the
> codecs mechanism is powerful, and even more so because they can determine
> their own returntype. But it still smells and feels like a Perl attitude,
> for the reasons already explained numerous times, as well:

It's by no means a Perl attitude.

The main reason is symmetry and the fact that strings and Unicode
should be as similar as possible in order to simplify the task of
moving from one to the other.

>  - The return value for the non-unicode encodings depends on the value of
>the encoding argument.

Not really: you'll always get a basestring instance.

>  - The general case, by and large, especially in non-powerusers, is to
>encode unicode to bytestrings and to decode bytestrings to unicode. And
>that is a hard enough task for many of the non-powerusers. Being able to
>use the encode/decode methods for other tasks isn't helping them.

Agreed.

Still, I believe that this is an educational problem. There are
a couple of gotchas users will have to be aware of (and this is
unrelated to the methods in question):

* "encoding" always refers to transforming original data into
  a derived form

* "decoding" always refers to transforming a derived form of
  data back into its original form

* for Unicode codecs the original form is Unicode, the derived
  form is, in most cases, a string

As a result, if you want to use a Unicode codec such as utf-8,
you encode Unicode into a utf-8 string and decode a utf-8 string
into Unicode.

Encoding a string is only possible if the string itself is
original data, e.g. some data that is supposed to be transformed
into a base64 encoded form.

Decoding Unicode is only possible if the Unicode string itself
represents a derived form, e.g. a sequence of hex literals.

> That is why I disagree with the hypergeneralization of the encode/decode
> methods, regardless of the fact that it is a natural expansion of the
> implementation of codecs. Sure, it looks 'right' and 'natural' when you look
> at the implementation. It sure doesn't look natural, to me and to many
> others, when you look at the task of encoding and decoding
> bytestrings/unicode.

That's because you only look at one specific task.

Codecs also unify the various interfaces to common encodings
such as base64, uu or zip which are not Unicode related.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Ron Adam
Josiah Carlson wrote:
> Ron Adam <[EMAIL PROTECTED]> wrote:
>> Josiah Carlson wrote:
>>> Bengt Richter had a good idea with bytes.recode() for strictly bytes
>>> transformations (and the equivalent for text), though it is ambiguous as
>>> to the direction; are we encoding or decoding with bytes.recode()?  In
>>> my opinion, this is why .encode() and .decode() makes sense to keep on
>>> both bytes and text, the direction is unambiguous, and if one has even a
>>> remote idea of what the heck the codec is, they know their result.
>>>
>>>  - Josiah
>> I like the bytes.recode() idea a lot. +1
>>
>> It seems to me it's a far more useful idea than encoding and decoding by 
>> overloading and could do both and more.  It has a lot of potential to be 
>> an intermediate step for encoding as well as being used for many other 
>> translations to byte data.
> 
> Indeed it does.
> 
>> I think I would prefer that encode and decode be just functions with 
>> well defined names and arguments instead of being methods or arguments 
>> to string and Unicode types.
> 
> Attaching it to string and unicode objects is a useful convenience. 
> Just like x.replace(y, z) is a convenience for string.replace(x, y, z) . 
> Tossing the encode/decode somewhere else, like encodings, or even string,
> I see as a backwards step.
> 
>> I'm not sure on exactly how this would work. Maybe it would need two 
>> sets of encodings, ie.. decoders, and encoders.  An exception would be
>> given if it wasn't found for the direction one was going in.
>>
>> Roughly... something or other like:
>>
>>  import encodings
>>
>>  encodings.tostr(obj, encoding):
>> if encoding not in encoders:
>> raise LookupError 'encoding not found in encoders'
>> # check if obj works with encoding to string
>> # ...
>> b = bytes(obj).recode(encoding)
>> return str(b)
>>
>>  encodings.tounicode(obj, decodeing):
>> if decoding not in decoders:
>> raise LookupError 'decoding not found in decoders'
>> # check if obj works with decoding to unicode
>> # ...
>> b = bytes(obj).recode(decoding)
>> return unicode(b)
>>
>> Anyway... food for thought.
> 
> Again, the problem is ambiguity; what does bytes.recode(something) mean?
> Are we encoding _to_ something, or are we decoding _from_ something? 

This was just an example of one way that might work, but here are my 
thoughts on why I think it might be good.


In this case, the ambiguity is reduced as far as the encoding and 
decodings opperations are concerned.)

  somestring = encodings.tostr( someunicodestr, 'latin-1')

It's pretty clear what is happening to me.

 It will encode to a string an object, named someunicodestr, with 
the 'latin-1' encoder.

And also rusult in clear errors if the specified encoding is 
unavailable, and if it is, if it's not compatible with the given 
*someunicodestr* obj type.

Further hints could be gained by.

 help(encodings.tostr)

Which could result in... something like...
 """
 encoding.tostr( ,  ) -> string

 Encode a unicode string using a encoder codec to a
 non-unicode string or transform a non-unicode string
 to another non-unicode string using an encoder codec.
 """

And if that's not enough, then help(encodings) could give more clues. 
These steps would be what I would do. And then the next thing would be 
to find the python docs entry on encodings.

Placing them in encodings seems like a fairly good place to look for 
these functions if you are working with encodings.  So I find that just 
as convenient as having them be string methods.

There is no intermediate default encoding involved above, (the bytes 
object is used instead), so you wouldn't get some of the messages the 
present system results in when ascii is the default.

(Yes, I know it won't when P3K is here also)

> Are we going to need to embed the direction in the encoding/decoding
> name (to_base64, from_base64, etc.)?  That doesn't any better than
> binascii.b2a_base64 .  

No, that's why I suggested two separate lists (or dictionaries might be 
better).  They can contain the same names, but the lists they are in 
determine the context and point to the needed codec.  And that step is 
abstracted out by putting it inside the encodings.tostr() and 
encodings.tounicode() functions.

So either function would call 'base64' from the correct codec list and 
get the correct encoding or decoding codec it needs.


What about .reencode and .redecode?  It seems as
> though the 're' added as a prefix to .encode and .decode makes it
> clearer that you get the same type back as you put in, and it is also
> unambiguous to direction.

But then wouldn't we end up with multitude of ways to do things?

 s.encode(codec) == s.redecode(codec)
 s.decode(codec) == s.reencode(codec)
 unicode(s, codec) == s.decode(codec)
 str(u, codec) == u.encode(codec)
 str(s, codec) == s.encode(code

Re: [Python-Dev] bytes.from_hex()

2006-02-18 Thread Michael Hudson
This posting is entirely tangential.  Be warned.

"Martin v. Löwis" <[EMAIL PROTECTED]> writes:

> It's worse than that. The return *type* depends on the *value* of
> the argument. I think there is little precedence for that:

There's one extremely significant example where the *value* of
something impacts on the type of something else: functions.  The types
of everything involved in str([1]) and len([1]) are the same but the
results are different.  This shows up in PyPy's type annotation; most
of the time we just track types indeed, but when something is called
we need to have a pretty good idea of the potential values, too.

Relavent to the point at hand?  No.  Apologies for wasting your time
:)

Cheers,
mwh

-- 
  The ultimate laziness is not using Perl.  That saves you so much
  work you wouldn't believe it if you had never tried it.
-- Erik Naggum, comp.lang.lisp
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
>> Just because some codecs don't fit into the string.decode()
>> or bytes.encode() scenario doesn't mean that these codecs are
>> useless or that the methods should be banned.
> 
> No. The reason to ban string.decode and bytes.encode is that
> it confuses users.

Instead of starting to ban everything that can potentially
confuse a few users, we should educate those users and tell
them what these methods mean and how they should be used.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 18 2006)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] bytes.from_hex() [Was: PEP 332 revival in coordination with pep 349?]

2006-02-18 Thread Thomas Wouters
On Sat, Feb 18, 2006 at 12:06:37PM +0100, M.-A. Lemburg wrote:

> I've already explained why we have .encode() and .decode()
> methods on strings and Unicode many times. I've also
> explained the misunderstanding that can codecs only do
> Unicode-string conversions. And I've explained that
> the .encode() and .decode() method *do* check the return
> types of the codecs and only allow strings or Unicode
> on return (no lists, instances, tuples or anything else).
> 
> You seem to ignore this fact.

Actually, I think the problem is that while we all agree the
bytestring/unicode methods are a useful way to convert from bytestring to
unicode and back again, we disagree on their *general* usefulness. Sure, the
codecs mechanism is powerful, and even more so because they can determine
their own returntype. But it still smells and feels like a Perl attitude,
for the reasons already explained numerous times, as well:

 - The return value for the non-unicode encodings depends on the value of
   the encoding argument.

 - The general case, by and large, especially in non-powerusers, is to
   encode unicode to bytestrings and to decode bytestrings to unicode. And
   that is a hard enough task for many of the non-powerusers. Being able to
   use the encode/decode methods for other tasks isn't helping them.

That is why I disagree with the hypergeneralization of the encode/decode
methods, regardless of the fact that it is a natural expansion of the
implementation of codecs. Sure, it looks 'right' and 'natural' when you look
at the implementation. It sure doesn't look natural, to me and to many
others, when you look at the task of encoding and decoding
bytestrings/unicode.

-- 
Thomas Wouters <[EMAIL PROTECTED]>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


  1   2   >