Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-05 Thread Tatsuo Ishii

> GNU gettext does its own encoding conversion.  It reads the program's
> character encoding from the LC_CTYPE locale and converts the material in
> the translation catalogs on the fly for output.  This is great in general,
> really, but for the postmaster it's a problem.  If LC_CTYPE is fixed for
> the cluster and you later on change your mind about the message language
> the it will be recoded into the character set that LC_CTYPE says.  And if
> that character set does not match the one that is set as the backend
> encoding internally then who knows what will happen when this stuff is
> recoded again as it's sent to the client.  Big, big mess.

Then in another word, it's completely broken. Sigh.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-05 Thread Peter Eisentraut

Tatsuo Ishii writes:

> BTW, nls has same problem as above, no? I guess nls depeneds on locale
> and it may conflict with the database-specific encoding and/or the
> automatic FE/BE encoding conversion.

GNU gettext does its own encoding conversion.  It reads the program's
character encoding from the LC_CTYPE locale and converts the material in
the translation catalogs on the fly for output.  This is great in general,
really, but for the postmaster it's a problem.  If LC_CTYPE is fixed for
the cluster and you later on change your mind about the message language
the it will be recoded into the character set that LC_CTYPE says.  And if
that character set does not match the one that is set as the backend
encoding internally then who knows what will happen when this stuff is
recoded again as it's sent to the client.  Big, big mess.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-04 Thread Tatsuo Ishii

> The backend routines use the host OS locales, so look there.  On my
> machine I have several Russian locales, which seem to address the issue of
> character sets:
> 
> ru_RU
> ru_RU.koi8r
> ru_RU.utf8
> ru_UA
> russian
> 
> This is bogus, because the LC_CTYPE choice is cluster-wide and the
> encoding choice is database-specific (in other words: it's broken), but
> there's nothing we can do about that right now.

I thought his idea was using UTF-8 locale and Unicode (UTF-8) encoded
database.

> Btw., I just happened to think about this very issue over the last few
> days.  What I would like to attack for the next release is to implement
> character classification and conversion using the Unicode tables so we can
> cut the LC_CTYPE system locale out of the picture.  Perhaps this is what
> the poster was thinking of, too.

Interesting idea. If you are saying that you are going to remove the
dependecy on system locale, I will agree with your idea.

BTW, nls has same problem as above, no? I guess nls depeneds on locale
and it may conflict with the database-specific encoding and/or the
automatic FE/BE encoding conversion.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-04 Thread Serguei A. Mokhov



On Thu, 5 Sep 2002, Peter Eisentraut wrote:

> Date: Thu, 5 Sep 2002 00:46:39 +0200 (CEST)
> From: Peter Eisentraut <[EMAIL PROTECTED]>
> To: Tatsuo Ishii <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Subject: Re: [HACKERS] Multibyte support in oracle_compat.c
>
> Tatsuo Ishii writes:
>
> > > Functions upper,lower and initcap doesn't work with utf-8 data
>
> The backend routines use the host OS locales, so look there.  On my
> machine I have several Russian locales, which seem to address the issue of
> character sets:
>
> ru_RU
> ru_RU.koi8r
> ru_RU.utf8
> ru_UA
> russian

Yeah, our character sets is a major pain for internatianlization. And the
above list is not exhaustive. I guess you are right, for the time being
you'll have to bear with it.

-s


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to [EMAIL PROTECTED] so that your
message can get through to the mailing list cleanly



Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-04 Thread Peter Eisentraut

Tatsuo Ishii writes:

> > Functions upper,lower and initcap doesn't work with utf-8 data

The backend routines use the host OS locales, so look there.  On my
machine I have several Russian locales, which seem to address the issue of
character sets:

ru_RU
ru_RU.koi8r
ru_RU.utf8
ru_UA
russian

This is bogus, because the LC_CTYPE choice is cluster-wide and the
encoding choice is database-specific (in other words: it's broken), but
there's nothing we can do about that right now.

> > P.S.It doesn't seem bad for me to use lib unicode instead of functions like 
>mbtowc,wctomb from stdlib and towupper,towlower from wctype
>
> I'm not sure. What do you think, Peter or other guys who is familiar
> with Unicode?

I don't know that that libunicode is, but that shouldn't prevent us from
possibly evaluating it. :-)

Btw., I just happened to think about this very issue over the last few
days.  What I would like to attack for the next release is to implement
character classification and conversion using the Unicode tables so we can
cut the LC_CTYPE system locale out of the picture.  Perhaps this is what
the poster was thinking of, too.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [HACKERS] Multibyte support in oracle_compat.c

2002-09-04 Thread Tatsuo Ishii

> I found one bug in file src/backend/utils/adt/oracle_compat.c and there were 
>your name, related with Multibyte enhancement, so i write to you.
> Functions upper,lower and initcap doesn't work with utf-8 data which is not of 
>Latin letters.At my work i do databases for Russian users and when i tried to use 
>unicode encoding for database and Russsian alphabet than these functions didn't work. 
>So i wrote some patches, because i don't think that problem is in that or other shell 
>variable like LANG or LC_CTYPE. As i don't know any other 
> languages except Russian and English, i wrote small test(test.tar.gz) only for 
>them.Execute it befor and after patching and feel the difference:). And by the way, 
>do encodings(and appropriative languages) EUC_JP,EUC_CN,EUC_KR and EUC_TW have 
>logical operations upper,lower and initcap? 
>   regards,Eugene.

For EUC_JP, there is no upper,lower and initcap. I'm not sure about
other languages.

> P.S.It doesn't seem bad for me to use lib unicode instead of functions like 
>mbtowc,wctomb from stdlib and towupper,towlower from wctype, but may be somebody will 
>find decision based on them or other lib?

I'm not sure. What do you think, Peter or other guys who is familiar
with Unicode?

BTW, I don't like your patches. If there's no unicode.h, configure
aborts with:

configure: error: header file  is required for unicode support

which seems not acceptable to me. I suggest you #ifdef out the unicode
upper,lower and initcap support if libunicode and/or unicode.h are not
found in the system.
--
Tatsuo Ishii

(I have included patches for review purpose)



patches.tar.gz
Description: Binary data


test.tar.gz
Description: Binary data


---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://archives.postgresql.org