On Nov 14, 2009, at 15:44 , MacRuby wrote:
> #339: YAML error with UTF-16 string
> ---------------------------+------------------------------------------------
> Reporter: d...@… | Owner: lsansone...@…
> Type: defect | Status: closed
> Priority: critical | Milestone: MacRuby 0.5
> Component: MacRuby | Resolution: fixed
> Keywords: YAML encoding |
> ---------------------------+------------------------------------------------
>
> Comment(by jazz...@…):
>
> {{{
> $ macruby -e 'require "yaml"; puts "Rübe".to_yaml'
> --- "R\xFCbe"
> $ ruby1.9 -e 'require "yaml"; puts "Rübe".to_yaml'
> --- "R\xC3\xBCbe"
> }}}
>
> seems to work now! Macruby escpapes to UTF-16 and Ruby1.9 escapes to
> UTF-8.
Actually, it seems to me (though I'm willing to be corrected on this), that the
ruby1.9 encoding is simply wrong: It translates the accented character into
UTF-8, and then escapes the two UTF-8 characters separately. What this ends up
encoding is "Rübe", which is not what you want.
> I didn't find anything in YAML docs that describes that behaviour, both
> methods seem to be correct.
They can't possibly be BOTH correct, as interpreting the output of one
according to the theory of the other would give a different result. If you look
at the section in the YAML spec:
<http://www.yaml.org/spec/1.2/spec.html#id2776092>, you will see
[57] "Escaped 8-bit Unicode character."
This is NOT an UTF-8 character.
> But ruby 1.8 fails to load the UTF-16 YAML. That is not astonishing because
> IMHO there is now way to guess what is the correct escaping mode.
It's not astonishing because (a) 1.8 has very poor Unicode support anyway and
(b) this would hardly be the only bug in syck.
> I think escaping is not necessary here because the encoding of input and
> output is the same. This can easly be tested by
>
> {{{
> $ macruby -e 'require "yaml"; puts YAML::load "--- Rübe"'
> Rübe
> }}}
That's an interesting point. I think you're right that the YAML spec does not
require escaping of printable characters >\u007F. However, non-printable
characters DO have to be escaped, and for the printable ones, it could be
argued that erring on the side of escaping helps readability if the OS does not
have font coverage for some printable characters. In any case, the current
implementation tries to be conservative in what it generates and liberal in
what it accepts. I'm open to persuasion that we should avoid escaping
characters, provided there is a low-cost test for printability of general
Unicode characters (I have not yet checked whether one of the built-in
CFCharacterSets can give that; the descriptions were inconclusive).
Matthias
_______________________________________________
MacRuby-devel mailing list
[email protected]
http://lists.macosforge.org/mailman/listinfo.cgi/macruby-devel