Re: [Python-Dev] Unicode byte order mark decoding

Walter Dörwald Wed, 06 Apr 2005 04:48:52 -0700

Stephen J. Turnbull wrote:

"Martin" == Martin v L�wis <[EMAIL PROTECTED]> writes:


    Martin> I can't put these two paragraphs together. If you think
    Martin> that explicit is better than implicit, why do you not want
    Martin> to make different calls for the first chunk of a stream,
    Martin> and the subsequent chunks?

Because the signature/BOM is not a chunk, it's a header.  Handling the
signature/BOM is part of stream initialization, not translation, to my
mind.

The point is that explicitly using a stream shows that initialization
(and finalization) matter.  The default can be BOM or not, as a
pragmatic matter.  But then the stream data itself can be treated
homogeneously, as implied by the notion of stream.

I think it probably also would solve Walter's conundrum about
buffering the signature/BOM if responsibility for that were moved out
of the codecs and into the objects where signatures make sense.

Not really. In every encoding where a sequence of more than one byte maps to one Unicode character, you will always need some kind of buffering. If we remove the handling of initial BOMs from the codecs (except for UTF-16 where it is required), this wouldn't change any buffering requirements.

I don't know whether that's really feasible in the short run---I
suspect there may be a lot of stream-like modules that would need to
be updated---but it would be a saner in the long run.

I'm not exactly sure, what you're proposing here. That all codecs (even UTF-16) pass the BOM through and some other infrastructure is responsible for dropping it?

[...]


Bye,
   Walter D�rwald
_______________________________________________
Python-Dev mailing list
[email protected]
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode byte order mark decoding

Reply via email to