[RE: Win32::OLE->Option('CP') - encoding bug?]
Jan Dubois schrieb am 13.10.2010 um 14:50 (-0700):
> >   my $ret = $dom->xml;
> >   # The return value is unreliable. Looks like you'll get Unicode
> >   # characters or legacy bytes depending on content. You can fix this
> >   # starting from Perl 5.8. But should you have to?
> 
> You do get a string of characters. The characters may be UTF-8 encoded
> internally when they cannot be represented by the ANSI codepage.
> 
> In general this should not really matter, as Perl can upgrade/downgrade
> encodings internally as it sees fit. The one problem of course is that
> Win32::OLE uses CP_ACP for the regular encoding whereas Perl internals
> use Latin1, so any code points where CP_ACP is different from Latin1
> will get mangled.

This concerns 27 characters - more than I thought. Here's a listing of
the differences of CP1252 aka CP_ACP vs Latin1 aka ISO-8859-1:

128 80 = U+20AC : EURO SIGN
130 82 = U+201A : SINGLE LOW-9 QUOTATION MARK
131 83 = U+0192 : LATIN SMALL LETTER F WITH HOOK
132 84 = U+201E : DOUBLE LOW-9 QUOTATION MARK
133 85 = U+2026 : HORIZONTAL ELLIPSIS
134 86 = U+2020 : DAGGER
135 87 = U+2021 : DOUBLE DAGGER
136 88 = U+02C6 : MODIFIER LETTER CIRCUMFLEX ACCENT
137 89 = U+2030 : PER MILLE SIGN
138 8A = U+0160 : LATIN CAPITAL LETTER S WITH CARON
139 8B = U+2039 : SINGLE LEFT-POINTING ANGLE QUOTATION MARK
140 8C = U+0152 : LATIN CAPITAL LIGATURE OE
142 8E = U+017D : LATIN CAPITAL LETTER Z WITH CARON
145 91 = U+2018 : LEFT SINGLE QUOTATION MARK
146 92 = U+2019 : RIGHT SINGLE QUOTATION MARK
147 93 = U+201C : LEFT DOUBLE QUOTATION MARK
148 94 = U+201D : RIGHT DOUBLE QUOTATION MARK
149 95 = U+2022 : BULLET
150 96 = U+2013 : EN DASH
151 97 = U+2014 : EM DASH
152 98 = U+02DC : SMALL TILDE
153 99 = U+2122 : TRADE MARK SIGN
154 9A = U+0161 : LATIN SMALL LETTER S WITH CARON
155 9B = U+203A : SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
156 9C = U+0153 : LATIN SMALL LIGATURE OE
158 9E = U+017E : LATIN SMALL LETTER Z WITH CARON
159 9F = U+0178 : LATIN CAPITAL LETTER Y WITH DIAERESIS

http://msdn.microsoft.com/de-de/goglobal/cc305145%28en-us%29.aspx
http://www.microsoft.com/globaldev/reference/wincp.mspx

> The downgrading of results to CP_ACP is probably a mistake; I can't
> see how this would ever be useful.  It helps scripts that don't know
> how to deal with Unicode strings, but those shouldn't ask for CP_UTF8
> results in the first place.

Thanks. I also think it is a mistake.

> The internal confusion between Latin1 and CP_ACP is harder to deal
> with: the core text functions all assume Latin1, and the filesystem
> APIs all assume CP_ACP.  So if we were to fix this to always assume
> Latin1 internally, then all scripts that read filenames from backticks/
> qx(), or receive them from GUI dialogs, or read them from ANSI encoded
> text files will break unless they convert them to Latin1 explicitly.
> 
> Maybe that breakage is necessary eventually, but it won't happen for
> Perl 5.14, so any change there is a long way off.

I was blissfully unaware of this issue.

> > A data-dependent return value encoding is difficult to work with.
> 
> Ignoring the CP_ACP/Latin1 issue, why does it matter which internal
> encoding is used for your strings?

It matters in a situation where I want to store that string in a
database accepting XML documents using Perl 5.6.1 (yes). In the case
where the string includes Greek or Russian characters, UTF-8 encoding
happens; in the case where all the characters are < 256, UTF-8 encoding
does not happen, which results in a parse error.

The (crappy) solution I have now is to try and store the string; if that
fails, I encode it into UTF-8 and try again.

> You commented out the line that put STDOUT into Unicode mode:
> 
> # binmode STDOUT, ':utf8' unless $P56;
> 
> But if you re-activate the line, then you will see that the characters
> are written out the same way, regardless of the way they have been
> encoded internally.

Yes. It is easier from 5.8 onwards.

Thanks for your help!
-- 
Michael Ludwig
_______________________________________________
ActivePerl mailing list
[email protected]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to