On Wed, Sep 17, 2014 at 3:46 AM, R. David Murray <rdmur...@bitdance.com> wrote: >> You can't treat them as characters, so while you have them in your >> string, you can't treat it as a pure Unicode string - it''s a Unicode >> string with smuggled bytes. > > Well, except that I do. The email header parsing algorithms all work > fine if I treat the surrogate escaped bytes as 'unknown junk' and just > parse based on the valid unicode. (Unless the header is so garbled that > it can't be parsed, of course, at which point it becomes an invalid > header).
Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters. If you, instead, represented the header as a list with some str elements and some bytes, it would be just as valid (though much harder to work with); all your manipulations are done on the str parts, and the bytes just tag along for the ride. > You are right about the wrapping, though. If a header with invalid > bytes (and in this scenario we *are* talking about errors) needs to > be wrapped, we have to first decode the smuggled bytes and turn it > into an 'unknown-8bit' encoded word before we can wrap the header. Yeah, and that's going to be a bit messy. If you get 60 characters followed by 30 unknown bytes, where do you wrap it? Dare you wrap in the middle of the smuggled section? ChrisA _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com