Bruno Haible wrote in
 <4298913.vrqWZg68TM@omega>:
 |Steffen Nurpmeso wrote:
 |>  ...
 |>| [.] "UTF-7"."
 |> 
 |> That is overshoot.
 |
 |No. UTF-7 is invalid here because it produces output that is not NUL
 |terminated. See:
 |
 |$ printf 'ab\0' | iconv -t UTF-7 | od -t c
 |0000000   a   b   +   A   A   A   -
 |0000007
 |
 |strlen() on such a return value makes invalid memory accesses.
 |You can convince yourself by running
 |$ OUTPUT_CHARSET=UTF-7 valgrind ls --help

This is then surely bogus?  UTF-7 is a normal single byte
character set and is to be terminated like anything else.  Nothing
in RFC 2152 nor RFC 3501 if you want makes me think something
else.  (RFC 5092 "IMAP URL Scheme", which invents the sane-enough-
to-think-yourself "UTF-7 -> UTF-16 -> UCS-4 -> UTF-8 -> HEX"
conversion scheme, and reverse, even implies the opposite, the
example functions both NUL terminate the string.)
Except Mark Davis said something like "UTF-7 was a failure"
once on the Unicode ML, if i recall correctly, and i surely added
"sadly", given the Punycode mess with domain names.
But one more ship that sailed.  But a pity it is.
Why should NUL be treated differently??  No.  No, i think it is
a bug in GNU iconv that noone stumbled upon because noone is using
UTF-7.  Heck, how about that, for example:

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t c
  0000000  \0  \0   a  \0   b  \0  \0  \0

Two leading NULs?

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t ucs-2 | od -t c
  0000000   a  \0   b  \0  \0  \0

That yes.

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-8 | od -t c
  0000000   a   b  \0

Yes.

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-7 | od -t c
  0000000   a   b   +   A   A   A   -

No.  Somehow they all bogus, take SunOS 5.10:

  LC_ALL=C printf 'ab\0' |  iconv -f iso-8859-1 -t utf-16 | od -t
  0000000 376 377  \0   a  \0   b  \0  \0

Ooh, now it gets scary!!  Interestingly OpenBSD 7.1 behaves the
same, likely it is an old instance of GNU iconv thus, there it
says "GNU libiconv 1.16", here it says "iconv (GNU libc) 2.35".

So unless someone convinces me you are arguing based on buggy
software.  UTF-7 is just another 7-bit single byte character set,
and thus.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

  • POSIX bind_textdomai... Bruno Haible via austin-group-l at The Open Group
    • Re: POSIX bind_... Steffen Nurpmeso via austin-group-l at The Open Group
      • Re: POSIX b... Bruno Haible via austin-group-l at The Open Group
        • Re: POS... Steffen Nurpmeso via austin-group-l at The Open Group
          • Re:... Harald van Dijk via austin-group-l at The Open Group
            • ... Steffen Nurpmeso via austin-group-l at The Open Group
              • ... Harald van Dijk via austin-group-l at The Open Group
                • ... Steffen Nurpmeso via austin-group-l at The Open Group
                • ... Harald van Dijk via austin-group-l at The Open Group
                • ... Steffen Nurpmeso via austin-group-l at The Open Group
                • ... Steffen Nurpmeso via austin-group-l at The Open Group
    • Re: POSIX bind_... Geoff Clare via austin-group-l at The Open Group
      • Re: POSIX b... Steffen Nurpmeso via austin-group-l at The Open Group
        • Re: POS... Geoff Clare via austin-group-l at The Open Group

Reply via email to