Re: Fun with UDCs in Shift-JIS

Markus Scherer Thu, 17 Jan 2002 09:56:46 -0800

Lars Marius Garshol wrote:

> I've just discovered that it seems that Shift-JIS encodes a number of
> User-Defined Characters in the 0xF040 to 0xFCFC range, and that these

Yes, and every implementor may assign characters to them as they see fit.

> characters are used in web pages. Does anyone know of a source of

The problem being that most likely they are all tagged as charset="Shift_JIS", without
distinguishing the variant of what's in the Shift-JIS encoding. Unreliable tagging is
very common. That's one good reason why we all advocate Unicode...

> mappings for these characters, or even have information about what
> kinds of characters are found in this area?

Given how many Windows machines there are, and given that Shift-JIS seems to be more
popular on Windows than on Unixes, let's look at the Shift-JIS<->Unicode mapping table
for windows-932:
http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup
(From our collection of mapping tables at http://oss.software.ibm.com/icu/charset/)

Shift-JIS F040..F9FC appears to be contiguously and linearly mapped to U+E000..U+E757.
Some further Shift-JIS UDCs map to Unicode CJK compatibility characters U+FAxx.
Note that Windows uses some of the Unicode BMP PUA space for CJK characters in Unicode
mode, for fonts and actual text processing.

Other Shift-JIS variants from different platforms will use a different assignment, but
I would try the Windows variant first for whatever web page you are looking at. As a
receiver, maybe you can figure out which platform generated the file, from a <meta>
tag or an http server identification.

As a recommendation, if you _have_ to _generate_ Shift-JIS web pages, you should avoid
UDCs and instead use NCRs (with Unicode non-PUA[!] code points).

The W3C has a page about the problems with Japanese charset identifiers and mapping
tables.

markus

Re: Fun with UDCs in Shift-JIS

Reply via email to