2012/5/30 "Martin J. Dürst" <due...@it.aoyama.ac.jp>: > On 2012/05/30 4:42, Roozbeh Pournader wrote: > >> Just look what happened when the Japanese did their own font/character set >> hack. The backslash/yen problem is still with us, to this day... > > > To be fair, the Japanese Yen at 0x5C was there long before Unicode, in the > Japanese version of ISO 646. That it has remained as a font hack is very > unfortunate, but for that, not only the Japanese, but also major > international vendors are to blame.
As long as it was part of the Japanese version of ISO 646 (which itself was only the first page of the SJIS encoding), there was absolutely NO problem at all. This was not different from the situation of all other national versions of ISO 646, which were all distinct encodings. The situation became a problem when the Japanese ISO 646 started to be mapped to Unicode/ISO/IEC 10646 within fonts using incorrect mappings. This occured in the early stages of ISO/IEC 10646 development. And unfortunately several OSes for Japan used those incorrect mappings, assuming that it was still safe to convert blindly texts containing backslashes by showing yen symbols instead, just like the same systems blindly converted US-ASCII (American version of ISO 646) into SJIS with broken algorithms, simply because those softwares could not really work with Unicode but still worked only with SJIS, and did not track correctly which source encoding was used. This would have probably not occured if Japan had defined and standardized an ISO 8859 version for mapping the Yen out of ASCII (along with basic Kana letters and Asian punctuations); but they prefered to develop only SJIS to support Kanjis (and later the emerging UCS remapped on it). And it would also have offered an easier migration. They were ambitious at the beginning, but the ambition was premature when the surrounding technologies to support a large character set was still very incomplete (forcing a lot of software to use unsafe/lossy remappings to a smaller character sets). So for several decennials, there has been a lot of interoperability problems caused by the various implementations of SJIS, many of them not compatible with each other in their limitations or in the way the "simplifications" were applied to support different parts of it. The backslash character, though it was common in many programming languages and OSes, then appeared to be replaced there by the yen symbol, and people were trained with it (for example when using pathnames in DOS/Windows filesystems, or when using the yen symbol as the escaping prefix when programming in C/C++); and it was then perceived that the backslash was for them a variant form (of their yen symbol) that they did not need (SJIS was later adapted to map the backslash somewhere else, but the SJIS users did not immediately fix it). As a result, the mapping of 0x5C in SJIS has always been ambiguous, depending on the implementations, but it has never been ambiguous in the Japanese version of ISO 646, that did not include the backslash. So don't criticize ISO 646, there was no problem there. The problem is fully within the early versions of SJIS which allowed such variation of glyphs, when it should have considered the yen symbol and the backslash as distinct abstract characters requiring separate mappings. But who uses the Japanese version of ISO 646 now in Japan ? Only SJIS seems to survive now, with all its intrinsic ambiguities and its many incompatible implementations (whose exact versions are most often not identified correctly in most softwares). The Japanese NB should have stopped this nightmare by fixing a rule to strongly deprecate (and remove all past recommandations), so that only one version of SJIS should survive, and that old data encoded with ambiguous SJIS version being left in their blackbox : It would have been simpler and more effective for the Japanease NB to rename the SJIS standard for the only remaining version, such as "UJIS" ("U" for "Universal", meaning that it has a full roundtrip compatibility with the UCS and no longer any ambiguity allowed) and then freeze it completely at this state (all other developments being made in the UCS), with a strong recommandation to NOT perform any blind conversion to UJIS or interpretation as UJIS of any past data encoded for an unversioned SJIS : all ambiguous characters in these old data should be detected as ambiguous, meaning that the document/data was not convertible without proper versioning. This would have forced also the various private software makers and manufacturers that had used their own version of SJIS to register again to the Japanese NB a SINGLE (and unique) string recommanded to identify their implementation of SJIS, removing all past known aliases that were also ambiguous between each other, so that the effective encofing old data could be uniquely identified and would then become uniquely convertible first to the national standard UJIS, then to the UCS by its warrantied roundtrip compatibility.