On Mon, Jul 15, 2013 at 05:11:40PM +0900, Tatsuo Ishii wrote: > > Does support for alternative multi-byte encodings have something to do > > with the Han unification controversy? I don't know terribly much about > > this, so apologies if that's just wrong. > > There's a famous problem regarding conversion between Unicode and other > encodings, such as Shift Jis. > > There are lots of discussion on this. Here is the one from Microsoft: > > http://support.microsoft.com/kb/170559/EN-US
Apart from Shift-JIS not being a well defined (it's more a family of encodings) it has the unusual feature of providing multiple ways to encode the same character. This is not even a Han unification issue, they have largely been addressed. For example, the square-root symbol exists twice (0x8795 and 0x81E3) and many other mathmatical symbols also. Here's the code page which you can browse online: http://msdn.microsoft.com/en-us/goglobal/cc305152 Which means to be round-trippable Unicode would have to double those characters, but this would make it hard/impossible to round-trip with any other character set that had those characters. No easy solution here. Something that has been done before [1] is to map the doubles to the custom area of the unicode space (0xe000-0xffff). It gives you round-trip support at the expense of having to handle those characters yourself. But since postgres doesn't do anything meaningful with unicode characters this might be acceptable. [1] Python does a similar trick to handle filenames coming from disk in an unknown encoding: http://docs.python.org/3/howto/unicode.html#files-in-an-unknown-encoding Have a nice day, -- Martijn van Oosterhout <klep...@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
signature.asc
Description: Digital signature