Lars Marius Garshol wrote:

> I've just discovered that it seems that Shift-JIS encodes a number of
> User-Defined Characters in the 0xF040 to 0xFCFC range, and that these


Yes, and every implementor may assign characters to them as they see fit.


> characters are used in web pages. Does anyone know of a source of


The problem being that most likely they are all tagged as charset="Shift_JIS", without 
distinguishing the variant of what's in the Shift-JIS encoding. Unreliable tagging is 
very common. That's one good reason why we all advocate Unicode...


> mappings for these characters, or even have information about what
> kinds of characters are found in this area?


Given how many Windows machines there are, and given that Shift-JIS seems to be more 
popular on Windows than on Unixes, let's look at the Shift-JIS<->Unicode mapping table 
for windows-932: 
http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup
(From our collection of mapping tables at http://oss.software.ibm.com/icu/charset/)

Shift-JIS F040..F9FC appears to be contiguously and linearly mapped to U+E000..U+E757.
Some further Shift-JIS UDCs map to Unicode CJK compatibility characters U+FAxx.
Note that Windows uses some of the Unicode BMP PUA space for CJK characters in Unicode 
mode, for fonts and actual text processing.

Other Shift-JIS variants from different platforms will use a different assignment, but 
I would try the Windows variant first for whatever web page you are looking at. As a 
receiver, maybe you can figure out which platform generated the file, from a <meta> 
tag or an http server identification.


As a recommendation, if you _have_ to _generate_ Shift-JIS web pages, you should avoid 
UDCs and instead use NCRs (with Unicode non-PUA[!] code points).

The W3C has a page about the problems with Japanese charset identifiers and mapping 
tables.

markus



Reply via email to