Lars Marius Garshol wrote: > I've just discovered that it seems that Shift-JIS encodes a number of > User-Defined Characters in the 0xF040 to 0xFCFC range, and that these
Yes, and every implementor may assign characters to them as they see fit. > characters are used in web pages. Does anyone know of a source of The problem being that most likely they are all tagged as charset="Shift_JIS", without distinguishing the variant of what's in the Shift-JIS encoding. Unreliable tagging is very common. That's one good reason why we all advocate Unicode... > mappings for these characters, or even have information about what > kinds of characters are found in this area? Given how many Windows machines there are, and given that Shift-JIS seems to be more popular on Windows than on Unixes, let's look at the Shift-JIS<->Unicode mapping table for windows-932: http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup (From our collection of mapping tables at http://oss.software.ibm.com/icu/charset/) Shift-JIS F040..F9FC appears to be contiguously and linearly mapped to U+E000..U+E757. Some further Shift-JIS UDCs map to Unicode CJK compatibility characters U+FAxx. Note that Windows uses some of the Unicode BMP PUA space for CJK characters in Unicode mode, for fonts and actual text processing. Other Shift-JIS variants from different platforms will use a different assignment, but I would try the Windows variant first for whatever web page you are looking at. As a receiver, maybe you can figure out which platform generated the file, from a <meta> tag or an http server identification. As a recommendation, if you _have_ to _generate_ Shift-JIS web pages, you should avoid UDCs and instead use NCRs (with Unicode non-PUA[!] code points). The W3C has a page about the problems with Japanese charset identifiers and mapping tables. markus

