In message <[EMAIL PROTECTED]>
          James Mastros <[EMAIL PROTECTED]> wrote:

> > That leaves the third, which is what I have implemented. When looking to
> > transcode from A to B it will first ask A if can it transcode to B and
> > if that fails then it will ask B if it can transcode from A.
> I propose another variant on this:
> If that fails, it asks A to transcode to Unicode, and B to transcode from
> Unicode.  (Not Unicode to transcode to B; Unicode implements no transcodings.)

My code does that, though at a slightly higher level. If you look
at string_transcode() you will see that if it can't find a direct
mapping it will go via unicode. If C had closures then I'd have
buried that down in the chartype_lookup_transcoder() layer, but it
doesn't so I couldn't ;-)

> > The problem it raises is, whois reponsible for transcoding from ASCII to
> > Latin-1? and back again? If we're not careful both ends will implement
> > both translations and we will have effective duplication.
> 1) Neither.  Each must support transcoding to and from Unicode.

Absolutely.

> 2) But either can support converting directly if it wants.

The danger is that everybody tries to be clever and support direct
conversion to and from as many other character sets as possible, which
leads to lots of duplication.

> I also think that, for efficency, we might want a "7-bit chars match ASCII"
> flag, since most charactersets do, and that means that we don't have to deal
> with the overhead for strings that fit in 7 bits.  This smells of premature
> optimization, though, so sombody just file this away in their heads for
> future reference.

I have already been thinking about this although it does get more
complicated as you have to consider the encoding as well - if you
have a single byte encoded ASCII string then transcoding to a single
byte encoded Latin-1 string is a no-op, but that may not be true for
other encodings if such a thing makes sense for those character types.

> (BTW, for those paying attention, I'm waiting on this discussion for my
> chr/ord patch, since I want them in terms of charsets, not encodings.)

I suspect that the encode and decode methods in the encoding vtable
are enough for doing chr/ord aren't they?

Surely chr() is just encoding the argument in the chosen encoding (which
can be the default encoding for the char type if you want) and then setting
the type and encoding of the resulting string appropriately.

Equally ord() is decoding the first character of the string to get a
number.

Tom

-- 
Tom Hughes ([EMAIL PROTECTED])
http://www.compton.nu/

Reply via email to