Re: [HACKERS] Patch for collation using ICU

John Hansen Mon, 09 May 2005 14:06:43 -0700

Tatsuo Ishii wrote:
> Sent: Tuesday, May 10, 2005 12:32 AM
> To: John Hansen
> Cc: [email protected]; [EMAIL PROTECTED]; 
> [email protected]
> Subject: Re: [HACKERS] Patch for collation using ICU
> 
> > > -----Original Message-----
> > > From: Tatsuo Ishii [mailto:[EMAIL PROTECTED]
> > > Sent: Sunday, May 08, 2005 11:08 PM
> > > To: John Hansen
> > > Cc: [email protected]; [EMAIL PROTECTED]; 
> > > [email protected]
> > > Subject: Re: [HACKERS] Patch for collation using ICU
> > > 
> > > > > I don't buy it. If current conversion tables does the
> > > right thing,
> > > > > why we need to replace. Or if conversion tables are not
> > > correct, why
> > > > > don't you fix it? I think the rule of character
> > > conversion will not
> > > > > change frequently, especially for LATIN languages. Thus
> > > maintaining
> > > > > cost is not too high.
> > > > 
> > > > I never said we need to, but if we're going to implement
> > > ICU, then we
> > > > might as well go all the way.
> > > 
> > > So you admit there's no benefit using ICU for replacing existing 
> > > conversions?
> > > 
> > > Besides ICU does not support all existing conversions, I 
> think ICU 
> > > has serious flaw for using conversion. If I understand correctly, 
> > > ICU uses UNICODE internally to do the conversion. For example, to 
> > > implement
> > > SJIS->EUC_JP conversion, ICU first converts SJIS to UNICODE then
> > > converts UNICODE to EUC_JP. Problem is these conversion 
> is not roud 
> > > trip(conversion between SJIS/EUC_JP and UNICODE will lose some 
> > > information). Thus SJIS->EUC_JP->SJIS conversion using 
> ICU does not 
> > > preserve original text.
> > 
> > Just for the record, I fetched a web page encoded in sjis, and 
> > converted it to euc-jp and back using uconv from ICU 3.2, and the 
> > result is the original is identical to the transformed file.
> > 
> >  uconv -f Shift_JIS -t EUC-JP -o index.html.euc index.html  
> uconv -f 
> > EUC-JP -t Shift_JIS -o index.html.sjis index.html.euc  diff 
> index.html 
> > index.html.sjis
> 
> Not all SJIS/EUC_JP characters have the problem. You might want to
> try: Shift_JIS 0x81e6, 0x879a, 0xfa5b.
> 
> BTW, I got this with ICU 3.2:
> 
> $ uconv -f EUC_JP -t Shift_JIS /tmp/a.txt -o /tmp/b.txt 
> Conversion from Unicode to codepage failed at input byte 
> position 0. Unicode: 301c Error: Invalid character found
> 
> The contents of a.txt is 0xa1c1 which is a valid EUC_JP character.


That actually makes perfect sense, since according to unicode.org's
database:
301C ~ WAVE DASH
       This character was encoded to match JIS C 6226-1978 1-33 "wave
dash".
       The JIS standards and some industry practise disagree in mapping.
         - 3030 wavy dash
         - FF5E full width tilde

In PG FF5E is the mapping currently used. That is obviously wrong
(according to the standards), as that is only a 'similar character'.

Unfortunately, there is no mapping from 301C to shift_jis, as shift_jis
doesn't define "WAVE DASH".
In all, I believe this behaviour to be correct according to the
standards.

There'd be nothing to stop us from defining alternative mappings for the
cases where we deviate from the standard, but the question is, should we
be non-standard?

> 
> This makes me nervous in using ICU...
> --
> Tatsuo Ishii
> 
> 

... John

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [HACKERS] Patch for collation using ICU

Reply via email to