On Dec 27, 7:37 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Certainly. ISO-2022 is famous for having ambiguous encodings. Try
> these:
>
> unicode("Hallo","iso-2022-jp")
> unicode("\x1b(BHallo","iso-2022-jp")
> unicode("\x1b(JHallo","iso-2022-jp")
> unicode("\x1b(BHal\x1b(Jlo","iso-2022-jp")
>
> or likewise
>
> unicode("[EMAIL PROTECTED]","iso-2022-jp")
> unicode("\x1b$BBB","iso-2022-jp")
>
> In iso-2022-jp-3, there are even more ways to encode the same string.

Wow, that's not easy to see why would anyone ever want that? Is there
any logic behind this?

In your samples both of unicode("\x1b(BHallo","iso-2022-jp") and
unicode("\x1b(JHallo","iso-2022-jp") give u"Hallo" -- does this mean
that the ignored/lost bytes in the original strings are not illegal
but *represent nothing* in this encoding?

I.e. in practice (in a context limited to the encoding in question)
should this be considered as a data loss, or should these strings be
considered "equivalent"?

Thanks!

mario
-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to