RE: cp1252 decoder implementation

Shawn Steele Wed, 21 Nov 2012 09:12:54 -0800

I’ll be more definitive than Murray ☺  Our legacy code pages aren’t going to 
change.  We won’t add more characters to 1252.  We won’t add new code pages.  
We aren’t going change names (since that’ll break anyone already using them), 
we probably won’t recognize new names (since anyone trying to use a new name 
wouldn’t work on millions of existing computers, so no one would add it).


The churn is too painful for customers.  If there’s a new character that 
everyone “must” use, we’ll point them at UTF-8 or UTF-16.  Any request to 
change codepage behavior would have to meet a very high bar.

The status of these 5 characters is already in the best fit mappings document 
pointed to by the IANA registry entry for windows-1252, which is strong as I’m 
willing to go for them.

The last thing I did WRT to code page standards was to ask for the best fit 
mappings to be posted so that the IANA charset registry would have something to 
reference to clarify the existing names.  It’s possible (if I find the time) 
that a few of the IANA charset entries could be updated to emphasize that some 
common names have differing implementations by different vendors/OS’s such as 
was done for shift_jis http://www.iana.org/assignments/charset-reg/shift_jis or 
the updates to point out the best fit mapping for 1252 at 
http://www.iana.org/assignments/charset-reg/windows-1252  In other words, the 
trend is to clarify that there are variations in behavior, and to please use 
Unicode.

Also see:
http://blogs.msdn.com/b/shawnste/archive/2007/09/24/are-we-going-to-update-or-maintain-the-best-fit-or-code-page-mappings.aspx
http://blogs.msdn.com/b/shawnste/archive/2008/01/17/code-pages-and-security-issues.aspx
http://blogs.msdn.com/b/shawnste/archive/2007/03/20/some-reasons-to-make-your-application-unicode.aspx

(and 
http://blogs.msdn.com/b/shawnste/archive/2012/06/16/building-the-lego-disney-wonder.aspx
 just because I think it’s cool)

I can see why HTML5 might think windows-1252 support is a good idea, but 
personally I’d’ve been happier if it wasn’t a requirement.  Too much code page 
corruption happens on the web, and most of the badly-tagged content probably 
misdeclares itself as 1252.  UTF-8 is a WAY better choice, particularly for the 
characters in the set supported by windows-1252.

-Shawn
( )

SSDE,
Microsoft

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Murray Sargent
Sent: Tuesday, November 20, 2012 8:55 PM
To: verd...@wanadoo.fr; Doug Ewell
Cc: Unicode Mailing List; Buck Golemon
Subject: RE: cp1252 decoder implementation

Phillipe commented: “(even if later Microsoft decides to map some other 
characters in its own "windows-1252" charset, like it did several times and 
notably when the Euro symbol was mapped)”.

Personal opinion, but I’d be very surprised if Microsoft ever changed the 1252 
charset. The euro was added back in 1999 when code pages were still used a lot. 
Code pages in general are pretty much irrelevant today except for reading 
legacy documents. They are virtually never used internally in modern software. 
UTF-8,UTF-16, and UTF-32 are what are used these days.

(But code pages do have the advantage that they are associated with specific 
character repertoires, which amounts to a great hint for font binding…)

Murray

RE: cp1252 decoder implementation

Reply via email to