Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-26 Thread Zeugswetter Andreas SB SD

  There are no such libraries.  I keep hearing ICU, but that is much too
  bloated.
 
 At least it is kind of standard and also something what will be
 maintained for foreseeable future, it also has a compatible license and
 is available on all platforms of interest to postgresql.

And it is used for DB/2 and Informix, so it must be quite feature complete
for DB relevant stuff.

Andreas

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-26 Thread Kurt Roeckx
On Tue, Nov 25, 2003 at 04:19:05PM -0500, Tom Lane wrote:
 
 UCS-2 is impractical without some *extremely* wide-ranging changes in
 the backend.  To take just the most obvious point, doesn't it require
 allowing embedded zero bytes in text strings?

If you're going to use unicode in the rest of the backend, you'll
have to be able to deal with them anyway.  You can't use normal C
string functions.


Kurt


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tatsuo Ishii
 OK, I've been spreading rumours about fixing the internationalization
 problems, so let me make it a bit more clear.  Here are the problems that
 need to be fixed:
 
 - Only one locale per process possible.
 
 - Only one gettext-language per process possible.
 
 - lc_collate and lc_ctype need to be held fixed in the entire cluster.
 
 - Gettext relies on iconv character set conversion, which relies on
   lc_ctype, which leads to a complete screw-up in the server because of
   the previous item.
 
 - Locale fixed per cluster, but encoding fixed per database, unware
   of each other, don't get along.
 
 - No support for upper/lower with multibyte encoding.
 
 - Implementation of Unicode horribly incomplete.
 
 These are all dependent on each other and sort of flow into each other.
 
 Here is a proposed ordering of steps toward improving the situation:
 
 1. Take out the character set conversion routines from the backend and
 make them a library of their own.  This could possibly be modelled after
 iconv, but not necessarily.  Or we might conclude that we can just use
 iconv in the first place.

How do you handle user-defined conversions?

 2. Reimplement gettext to use 1. and allow switching of language and
 encoding at run-time.
 
 3. Implement Unicode collation algorithm and character classification
 routines that are aware of 1.  Use that in place of system locale
 routines.

I don't see a relationship between Unicode and the one you are going
to replace the system locale routines. If you are going to the
direction for an Unicode central implementation, I will object.

 4. Allow choice of locale per database.  (This should be fairly easy after
 3.)
 
 5. Allow choice of locale per column and implement collation coercion
 according to SQL standard.
 
 This could easily take a long time, but I feel that even if we have to
 stop after 2., 3., or 4. at feature freeze, we'd be a lot farther.
 
 Comments?  Anything else that needs fixing?
 
 -- 
 Peter Eisentraut   [EMAIL PROTECTED]
 
 
 ---(end of broadcast)---
 TIP 6: Have you searched our list archives?
 
http://archives.postgresql.org
 

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Dennis Bjorklund
On Mon, 24 Nov 2003, Peter Eisentraut wrote:

 1. Take out the character set conversion routines from the backend and
 make them a library of their own.  This could possibly be modelled after
 iconv, but not necessarily.  Or we might conclude that we can just use
 iconv in the first place.
 
 2. Reimplement gettext to use 1. and allow switching of language and
 encoding at run-time.

Force all translations to be in unicode and convert to other client
encodings if needed. There is no need to support translations stored using
different encodings.

 3. Implement Unicode collation algorithm and character classification
 routines that are aware of 1.  Use that in place of system locale
 routines.

Couldn't we use some library that already have this, like glib (or
something else). If it's not up to what we need, than fix that library
instead.

--
/Dennis


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Peter Eisentraut
Tatsuo Ishii writes:

  3. Implement Unicode collation algorithm and character classification
  routines that are aware of 1.  Use that in place of system locale
  routines.

 I don't see a relationship between Unicode and the one you are going
 to replace the system locale routines. If you are going to the
 direction for an Unicode central implementation, I will object.

The Unicode collation algorithm works for any character set, not only for
Unicode.  It just happens to be published by the Unicode consortium.  So
basically this is just a concrete alternative to making up our own out of
thin air.  Also, the Unicode collation algorithm gives us the flexibility
to define customizations of collations that users frequently want, such as
ignoring or not ignoring punctuation.

Actually, what will more likely happen is that we'll define a collation as
a collection of one or more support functions, the equivalents of
strxfrm() and possibly a few more.  Then it will be up to those functions
to define the collation order.  The server will provide utility functions
that will facilitate implementing a collation order that follows the
Unicode collation algorithm, but you could just as well implement one
using memcmp() or whatever you like.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Peter Eisentraut
Dennis Bjorklund writes:

 Force all translations to be in unicode and convert to other client
 encodings if needed. There is no need to support translations stored using
 different encodings.

Tell that to the Japanese.

 Couldn't we use some library that already have this, like glib (or
 something else). If it's not up to what we need, than fix that library
 instead.

I wasn't aware that glib had this.  I'll look.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Zeugswetter Andreas SB SD
Have you looked at what is available from 
http://oss.software.ibm.com/icu/ ?

Seems they have a compatible license, but use some C++.

Andreas

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Dennis Bjorklund
On Tue, 25 Nov 2003, Peter Eisentraut wrote:

  Force all translations to be in unicode and convert to other client
  encodings if needed. There is no need to support translations stored using
  different encodings.
 
 Tell that to the Japanese.

I've always thought unicode was enough to even represent Japanese. Then 
the client encoding can be something else that we can convert to. In any 
way, the encoding of the message catalog has to be known to the system so 
it can be converted to the correct encoding for the client.

  Couldn't we use some library that already have this, like glib (or
  something else). If it's not up to what we need, than fix that library
  instead.
 
 I wasn't aware that glib had this.  I'll look.

And I don't really know what demands pg has, but glib has a lot of support 
functions for utf-8. At least we should take a look at it.

-- 
/Dennis


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tatsuo Ishii
 On Tue, 25 Nov 2003, Peter Eisentraut wrote:
 
   Force all translations to be in unicode and convert to other client
   encodings if needed. There is no need to support translations stored using
   different encodings.
  
  Tell that to the Japanese.
 
 I've always thought unicode was enough to even represent Japanese. Then 
 the client encoding can be something else that we can convert to. In any 
 way, the encoding of the message catalog has to be known to the system so 
 it can be converted to the correct encoding for the client.

I'm tired of telling that Unicode is not that perfect. Another gottcha
with Unicode is the UTF-8 encoding (currently we use) consumes 3
bytes for each Kanji character, while other encodings consume only 2
bytes. IMO 3/2 storage ratio could not be neglected for database use.
--
Tatsuo Ishii

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Dennis Bjorklund
On Tue, 25 Nov 2003, Tatsuo Ishii wrote:

 I'm tired of telling that Unicode is not that perfect. Another gottcha
 with Unicode is the UTF-8 encoding (currently we use) consumes 3
 bytes for each Kanji character, while other encodings consume only 2
 bytes. IMO 3/2 storage ratio could not be neglected for database use.

I'm aware of how utf-8 works and I was talking about the message 
cataloges. It does not affect what you store in the database in any way.

-- 
/Dennis


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Dennis Bjorklund
On Tue, 25 Nov 2003, Tatsuo Ishii wrote:

 I'm tired of telling that Unicode is not that perfect. Another gottcha
 with Unicode is the UTF-8 encoding (currently we use) consumes 3
 bytes for each Kanji character, while other encodings consume only 2
 bytes. IMO 3/2 storage ratio could not be neglected for database use.

The rest of the world seems to select unicode as the way to handle
different languages in the UI of programs. For example gnome supports
nothing but unicode. How is that handled in your country? I know that you
are tired of people who don't understand how difficult it is for you, but
I really would like to know. Is gnome not used over there because of this?

About storing data in the database, I would expect it to work with any
encoding, just like I would expect pg to be able to store images in any
format.

I'll try to not mention unicode near you in the feature :-)

-- 
/Dennis


---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 Dennis Bjorklund writes:
 Couldn't we use some library that already have this, like glib (or
 something else). If it's not up to what we need, than fix that library
 instead.

 I wasn't aware that glib had this.  I'll look.

Of course the trouble with relying on glibc is that we'd have no solution
for platforms that don't use glibc.

It might be okay to rely on glibc for a first-cut implementation,
realizing that we couldn't do everything at once anyway.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tom Lane
Peter Eisentraut [EMAIL PROTECTED] writes:
 Actually, what will more likely happen is that we'll define a collation as
 a collection of one or more support functions, the equivalents of
 strxfrm() and possibly a few more.  Then it will be up to those functions
 to define the collation order.  The server will provide utility functions
 that will facilitate implementing a collation order that follows the
 Unicode collation algorithm, but you could just as well implement one
 using memcmp() or whatever you like.

That sounds like a good plan to me.  Personally I'd want a
memcmp()-based collation implementation available, so that people who
don't care about sorting anything beyond 7-bit ASCII don't need to pay
a lot of overhead.

We have seen over and over that strcoll() is depressingly slow in some
locales (at least on some platforms).  Do you have any feeling for the
real-world performance of the Unicode algorithm?

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Doug McNaught
Tom Lane [EMAIL PROTECTED] writes:

 Peter Eisentraut [EMAIL PROTECTED] writes:
 
  I wasn't aware that glib had this.  I'll look.
 
 Of course the trouble with relying on glibc is that we'd have no solution
 for platforms that don't use glibc.

glib != glibc.  glib is the low-level library used by GTK and GNOME
for basic data structures, character handling etc.  It's LGPL AFAIK,
which would seem to rule out diredct use from a licensing perspective.

-Doug

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Hannu Krosing
Dennis Bjorklund kirjutas T, 25.11.2003 kell 14:51:
 On Tue, 25 Nov 2003, Tatsuo Ishii wrote:
 
  I'm tired of telling that Unicode is not that perfect. 

Of course not, but neither is the current multibyte with only marginal
support for unicode (many people actually need upper()/lower() )

 Another gottcha
  with Unicode is the UTF-8 encoding (currently we use) consumes 3
  bytes for each Kanji character, while other encodings consume only 2
  bytes. 

I think that for *storage* we should use SCSU (the Standard Compression
Scheme for Unicode).

 IMO 3/2 storage ratio could not be neglected for database use.

SCSU should solve that (actually it should use less than 2 bytes char
for encoding any single language)

 The rest of the world seems to select unicode as the way to handle
 different languages in the UI of programs. For example gnome supports
 nothing but unicode. How is that handled in your country? I know that you
 are tired of people who don't understand how difficult it is for you, but
 I really would like to know. Is gnome not used over there because of this?
 
 About storing data in the database, I would expect it to work with any
 encoding, just like I would expect pg to be able to store images in any
 format.
 
 I'll try to not mention unicode near you in the feature :-)

---
Hannu






---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Greg Stark

Peter Eisentraut [EMAIL PROTECTED] writes:

 2. Reimplement gettext to use 1. and allow switching of language and
 encoding at run-time.
 
 3. Implement Unicode collation algorithm and character classification
 routines that are aware of 1.  Use that in place of system locale
 routines.

This sounds like you want to completely reimplement all of the locale handling
provided by the OS? That seems like a dead-end approach to me. There's no way
your handling will ever be as complete or as well optimized as some OS's.

Better to find ways to use the OS gettext and locale handling on platforms
that provide good interfaces. On platforms that don't provide good interfaces
either don't support the features or use some third party library to provide
a good implementation.

The only thing you really need in the database is a second parameter on all
the collation functions like strxfrm(col,locale) etc. Then functional indexes
take care of almost everything.

The only advantage to adding locales per-column and/or per-index is the
notational simplicity. Queries could do simple standard expressions and not
have to worry about calling strxfrm or other locale-specific functions all the
time. I'm not sure it's worth the complexity of having to deal with 
WHERE xy where x and y are in different locales though.


-- 
greg


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Kurt Roeckx
On Tue, Nov 25, 2003 at 08:40:57PM +0900, Tatsuo Ishii wrote:
  On Tue, 25 Nov 2003, Peter Eisentraut wrote:
  
  I've always thought unicode was enough to even represent Japanese. Then 
  the client encoding can be something else that we can convert to. In any 
  way, the encoding of the message catalog has to be known to the system so 
  it can be converted to the correct encoding for the client.
 
 I'm tired of telling that Unicode is not that perfect.

Maybe it should be explained what the problems really are,
instead of saying it isn't perfect?

From what I understand there is only a problem converting from
the legacy encoding to unicode, and the other way around, and
no problem if you stop doing the conversion.

The conversion problem is because what in an encoding is only
represented by 1 character can be several characters in unicode.

Some examples people might understand are:
- µ: In iso 8859-1 it's char 0xB5.  In unicode it can be U+00B5 (micro
sign) or U+03BC (greek letter small mu)
- Å: ISO 8859-1: 0xC5. Unicode U+00C5 (latin capital letter a
with ring above) or U+212B (angstrom sign)
- The ohm sign vs the greek letter omega.
- Quotation marks: You have left double quote, right double
  quote, and a few others.

 Another gottcha
 with Unicode is the UTF-8 encoding (currently we use) consumes 3
 bytes for each Kanji character, while other encodings consume only 2
 bytes. IMO 3/2 storage ratio could not be neglected for database use.

You can encode unicode in different ways, and UTF-8 is only one
of them.  Is there a problem with using UCS-2 (except that it
would require more storage for ASCII)?


Kurt


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send unregister YourEmailAddressHere to [EMAIL PROTECTED])


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Peter Eisentraut
Greg Stark writes:

 This sounds like you want to completely reimplement all of the locale handling
 provided by the OS? That seems like a dead-end approach to me. There's no way
 your handling will ever be as complete or as well optimized as some OS's.

Actually, I'm pretty sure it will be more complete.  About the
optimization, we'll have to see.

 Better to find ways to use the OS gettext and locale handling on platforms
 that provide good interfaces.

There are no such platforms to my knowledge.  The exception is some
version of glibc that provides undocumented interfaces to functionality
that is rumoured to do something that may or may not be relevant to what
we're doing.

 On platforms that don't provide good interfaces either don't support the
 features or use some third party library to provide a good
 implementation.

There are no such libraries.  I keep hearing ICU, but that is much too
bloated.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Hannu Krosing
Peter Eisentraut kirjutas T, 25.11.2003 kell 21:13:
 Greg Stark writes:
 
  This sounds like you want to completely reimplement all of the locale handling
  provided by the OS? That seems like a dead-end approach to me. There's no way
  your handling will ever be as complete or as well optimized as some OS's.
 
 Actually, I'm pretty sure it will be more complete.  About the
 optimization, we'll have to see.
 
  Better to find ways to use the OS gettext and locale handling on platforms
  that provide good interfaces.
 
 There are no such platforms to my knowledge. 

Unless you consider ICU (http://oss.software.ibm.com/icu/) as a
platform ;)

We will hardly ever be more complete than it.

 There are no such libraries.  I keep hearing ICU, but that is much too
 bloated.

At least it is kind of standard and also something what will be
maintained for foreseeable future, it also has a compatible license and
is available on all platforms of interest to postgresql.

And I am not sure that this bloat will affect us too much unless we
want to start maintaining a parallel copy - glibc is much more bloated
than ICU .

But if you insist on rolling your own library, you can always use ICU to
write regression test to compare yours with ...

-
Hannu


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tom Lane
Kurt Roeckx [EMAIL PROTECTED] writes:
 You can encode unicode in different ways, and UTF-8 is only one
 of them.  Is there a problem with using UCS-2 (except that it
 would require more storage for ASCII)?

UCS-2 is impractical without some *extremely* wide-ranging changes in
the backend.  To take just the most obvious point, doesn't it require
allowing embedded zero bytes in text strings?

regards, tom lane

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Christopher Kings-Lynne
About storing data in the database, I would expect it to work with any
encoding, just like I would expect pg to be able to store images in any
format.
What's stopping us supporting the other Unicode encodings, eg. UCS-16 
which could save Japansese storage space.

Chris



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faqs/FAQ.html


Re: [HACKERS] A rough roadmap for internationalization fixes

2003-11-25 Thread Tom Lane
Greg Stark [EMAIL PROTECTED] writes:
 The only advantage to adding locales per-column and/or per-index is the
 notational simplicity.

Well, actually, the reason we are interested in doing it is the SQL spec
demands it.

regards, tom lane

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster