Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-18 Thread Martin Flahault

Very interesting discussion indeed.

It seems that Postgresql:The world's most advanced open source database can 
not work properly on Mac OS X: the world's most advanced operating system and 
FreeBSD.

Don't you think postgresql.org should remove from their download page the links 
to FreeBSD and Mac OS X binary packages?

Martin Flahault 



Le 18 janv. 2010 à 08:10, Martijn van Oosterhout a écrit :

 On Sat, Jan 16, 2010 at 09:10:53PM +, Greg Stark wrote:
 Switching to ICU means trading our current inconsistency from platform
 to platform for a different inconsistency which would be better in
 some cases and worse in others.
 
 Or, you can have the cake and eat it too. That is, aim for the end goal
 and let people choose what library they want to use for sorting (that
 is, extend the meaning of the locale identifier). Patches for this
 should be in the archives somewhere. As I recall the reason this was
 rejected is that *BSD lack the capability of handling multiple
 collation algorithms at all at the libc level (that is, if you don't
 just tell people to use ICU in that case).
 
 Mac OS X doesn't have great POSIX locale support but at least they
 implemented strcoll_l.
 
 Have a nice day,
 -- 
 Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-17 Thread Martijn van Oosterhout
On Sat, Jan 16, 2010 at 09:10:53PM +, Greg Stark wrote:
 Switching to ICU means trading our current inconsistency from platform
 to platform for a different inconsistency which would be better in
 some cases and worse in others.

Or, you can have the cake and eat it too. That is, aim for the end goal
and let people choose what library they want to use for sorting (that
is, extend the meaning of the locale identifier). Patches for this
should be in the archives somewhere. As I recall the reason this was
rejected is that *BSD lack the capability of handling multiple
collation algorithms at all at the libc level (that is, if you don't
just tell people to use ICU in that case).

Mac OS X doesn't have great POSIX locale support but at least they
implemented strcoll_l.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-16 Thread Greg Stark
On Thu, Jan 14, 2010 at 6:15 AM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 Out of interest: Why not?

 There's plenty of discussion in the archives about it, but basically
 ICU would represent a pretty enormous dependency and would lock us in
 to having no other backend encoding but UTF8.

 Thanks. You're right - I should've just STFA ;-) so I appreciate the
 brief explanation.

There's also the question of whether being inconsistent with the rest
of the system is really the right thing at all. If a programmer does a
sort in the database and then writes application code using the same
collation on the same system which depends on the data being sorted
it's nice that that works. Or if an admin presorts the data using sort
before doing a data load he might reasonable expect the table to be
effectively clustered.

Switching to ICU means trading our current inconsistency from platform
to platform for a different inconsistency which would be better in
some cases and worse in others.


-- 
greg

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Maximilian Tyrtania
Am 12.01.2010 um 12:36 schrieb Martin Flahault:

 We have spend some time evaluating PostgreSQL and we can't get correct 
 outputs with the ORDER BY command.
 LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.
 
 It seems there is a known problem with the collating order of text including 
 diacritics with the UTF8 encodings on BSD systems.
 
 Does anyone know a workaround ?

The best i've seen so far is:

CREATE OR REPLACE FUNCTION f_getorderbyfriendlyversion(texttoconvert text)
 RETURNS text AS
$BODY$
select translate(upper($1),'ÄÖÜ','AOU')--add french diacritical characters here
$BODY$
 LANGUAGE 'sql' IMMUTABLE STRICT
 COST 100;
ALTER FUNCTION f_getorderbyfriendlyversion(text) OWNER TO postgres;

CREATE INDEX idx_mytable_myfield_orderbyfriendly
 ON mytable
 USING btree
 (f_getorderbyfriendlyversion(myfield::text));

Select * from mytable order by f_getorderbyfriendlyversion(myfield);

Not an ideal solution, but it seems to perform quite well.

Best wishes from Berlin,

Maximilian Tyrtania

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Craig Ringer

On 12/01/2010 7:36 PM, Martin Flahault wrote:

Hi,

We are a software publisher searching for a new DBMS for our software.
We have more than one hundred installed servers, running Mac OS and a
Primebase database.

We have spend some time evaluating PostgreSQL and we can't get correct
outputs with the ORDER BY command.


Can you provide a sample? Include a table of sample values, an example 
query, its output, and what you'd expect to get instead? And why?


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Martin Flahault

Here is an exemple :

postgres=# create database newbase;
CREATE DATABASE
postgres=# \c newbase;
psql (8.4.2)
You are now connected to database newbase.
newbase=# create table t1 (contenu text);
CREATE TABLE
newbase=# insert into t1 values ('a'), ('e'), ('à'), ('é'), ('A'), ('E');
INSERT 0 6

newbase=# select * from t1 order by contenu;
 contenu 
-
 A
 E
 a
 e
 à
 é
(6 rows)

newbase=# select * from t1 order by upper(contenu);
 contenu 
-
 a
 A
 e
 E
 à
 é
(6 rows)


Here is the encoding informations :

newbase=# \encoding
UTF8
newbase=# show lc_collate;
 lc_collate 

 fr_FR
(1 row)

newbase=# show lc_ctype;
 lc_ctype 
--
 fr_FR
(1 row)


As with others DBMS (MySQL for example), diacritics should be ignored when 
determining the sort order. Here is the expected output:
 a
 à
 A
 e
 é 
 E


It seems there is a problem with the collating order on BSD systems with 
diacritics using UTF8.
If you put this text :
a
A
à
é
e
E

in a UTF8 text file and use the sort command on it, you will have the same 
wrong output as with PostgreSQL :
A
E
a
e
à
é

Hope this will help,

Martin



Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Martijn van Oosterhout
On Wed, Jan 13, 2010 at 04:15:06PM +0100, Martin Flahault wrote:

[postgres]
 newbase=# select * from t1 order by contenu;
  contenu 
 -
  A
  E
  a
  e

Postgresql outputs whatever the C library does on the underlying
system. The quality of this varies wildly.
  à
 As with others DBMS (MySQL for example), diacritics should be ignored when 
 determining the sort order. Here is the expected output:

MySQL implements the unicode collation algorithm, which means it
essentially does what you want.
 
 It seems there is a problem with the collating order on BSD systems with 
 diacritics using UTF8.

Last I checked, BSD did not support useful sorting on UTF-8 at all, so
it's not surprised it doesn't work.

 in a UTF8 text file and use the sort command on it, you will have the same 
 wrong output as with PostgreSQL :

Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
for programs that would like true unicode collation, but there is
little chance that postgresql will ever use this.

Hope this helps,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 Please line up in a tree and maintain the heap invariant while 
 boarding. Thank you for flying nlogn airlines.


signature.asc
Description: Digital signature


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Craig Ringer

On 13/01/2010 11:15 PM, Martin Flahault wrote:


It seems there is a problem with the collating order on BSD systems with
diacritics using UTF8.
If you put this text :
a
A
à
é
e
E

in a UTF8 text file and use the sort command on it, you will have the
same wrong output as with PostgreSQL :
A
E
a
e
à
é


First: PostgreSQL expects the OS to behave correctly and sort according 
to the locale. It relies on the C library for this. If the C library 
doesn't do it right, PostgreSQL won't do it right either. So you need to 
get Mac OS X to do the right thing.


Your results match what I get on a Linux system without a properly 
generated fr_FR.UTF-8 locale. Libc falls back on the C locale, which 
sorts that way.


If I generate the fr_FR.UTF-8 locale and run the sort (on the file x), 
I get the desired result:


LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
a
A
à
e
E
é

I don't know Mac OS X well, but this is making me wonder if maybe you're 
just missing the required information for the locale, so libc is falling 
back on the C locale.


(Of course, being Mac OS X there are probably at least three out of date 
or simply false man pages describing the behaviour, none of which 
reflect the reality of a magic config key buried somewhere in NetInfo, 
for which the documentation is also completely out of date. Bitter? Me? 
Yeah, I admin a bunch of OS X machines on a business network.)


Hmm... a quick test suggests that Mac OS X (testing on 10.4) at least 
*thinks* it supports the fr_FR.UTF-8 locale:


osx104$ LANG=xxx LC_ALL=xxx locale
LANG=xxx
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=C
LC_NUMERIC=C
LC_TIME=C
LC_ALL=C

osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 locale
LANG=fr_FR.UTF-8
LC_COLLATE=fr_FR.UTF-8
LC_CTYPE=fr_FR.UTF-8
LC_MESSAGES=fr_FR.UTF-8
LC_MONETARY=fr_FR.UTF-8
LC_NUMERIC=fr_FR.UTF-8
LC_TIME=fr_FR.UTF-8
LC_ALL=fr_FR.UTF-8

osx104$ locale -a  | grep fr_FR
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8

... yet it clearly doesn't:

osx104$ LANG=C LC_ALL=C sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.UTF-8 LC_ALL=fr_FR.UTF-8 sort x
A
E
a
e
à
é
osx104$ LANG=fr_FR.ISO8859-1 LC_ALL=fr_FR.ISO8859-1 sort x
A
E
a
e
à
é

Mac OS X seems to keep its locale config in /usr/share/locale . Looking 
there, there are clearly LC_COLLATE files for fr_FR.UTF-8 . However, 
they're identical to those for en_US.UTF-8:


osx104$ cd /usr/share/locale
osx104$ diff fr_FR.UTF-8/LC_COLLATE en_US.UTF-8/LC_COLLATE

... so your OS's localized collation support is broken/missing, at least 
if the same is true for more modern versions of OS X.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Craig Ringer
Martijn van Oosterhout wrote:

 in a UTF8 text file and use the sort command on it, you will have the same 
 wrong output as with PostgreSQL :
 
 Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
 for programs that would like true unicode collation, but there is
 little chance that postgresql will ever use this.

Out of interest: Why not?

Using ICU would permit Pg to be independent of libc's collation rules,
finally permitting things like specifying a specific collation for a
textual sort. It'd make mixing data from different locales in a database
a lot easier (read: possible to do correctly).

Is this just a matter of nobody cares enough to produce a solid, tested
patch with equivalent performance that doesn't turn people who try to
review it green with disgust ... or are there specific reasons why
using something like ICU instead of libc's locale support is not
appropriate for Pg?

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Tom Lane
Craig Ringer cr...@postnewspapers.com.au writes:
 Martijn van Oosterhout wrote:
 Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
 for programs that would like true unicode collation, but there is
 little chance that postgresql will ever use this.

 Out of interest: Why not?

There's plenty of discussion in the archives about it, but basically
ICU would represent a pretty enormous dependency and would lock us in
to having no other backend encoding but UTF8.

The state of OS X's POSIX-spec locale support is pretty pitiful, but on
the whole I'd say if you need better UTF8 locale support you could use
another OS.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Craig Ringer
Tom Lane wrote:
 Craig Ringer cr...@postnewspapers.com.au writes:
 Martijn van Oosterhout wrote:
 Yes, that's the basic idea. Mac OS X apparently provides ICU underneath
 for programs that would like true unicode collation, but there is
 little chance that postgresql will ever use this.
 
 Out of interest: Why not?
 
 There's plenty of discussion in the archives about it, but basically
 ICU would represent a pretty enormous dependency and would lock us in
 to having no other backend encoding but UTF8.

Thanks. You're right - I should've just STFA ;-) so I appreciate the
brief explanation.

 The state of OS X's POSIX-spec locale support is pretty pitiful, but on
 the whole I'd say if you need better UTF8 locale support you could use
 another OS.

That's my personal opinion too ... I have the pleasure of
administrating an OS X Server and six Mac Pro clients at work, and have
become well acquainted with the exciting variety of bugs, undocumented
features, and bizarre quirks of that particular OS. POSIX locale
issues are the least of its issues.

Alas, people will want to run Pg on it anyway, especially when bundling
with an app. It'd be nice if it could be made to work smoothly ... but I
certainly don't care enough to try! Like you, I favour using an OS that
follows the specs it claims to support instead.

Perhaps someone who wants to use Mac OS X and Pg for their product will
come forward with some compat wrapper functions for the localizable
libc/posix functions, so Pg can just be built against the wrapper and
the rest of us need not care about OS X's bugs.

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Scott Marlowe
On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 Perhaps someone who wants to use Mac OS X and Pg for their product will
 come forward with some compat wrapper functions for the localizable
 libc/posix functions, so Pg can just be built against the wrapper and
 the rest of us need not care about OS X's bugs.

I know this sounds crazy, but couldn't Apple be bothered to fix their OS?  :)

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Tom Lane
Craig Ringer cr...@postnewspapers.com.au writes:
 Tom Lane wrote:
 The state of OS X's POSIX-spec locale support is pretty pitiful, but on
 the whole I'd say if you need better UTF8 locale support you could use
 another OS.

 Alas, people will want to run Pg on it anyway, especially when bundling
 with an app. It'd be nice if it could be made to work smoothly ... but I
 certainly don't care enough to try! Like you, I favour using an OS that
 follows the specs it claims to support instead.

For the record, I *like* OS X; I'm typing this on a Macbook Pro.  But
it doesn't do everything, and one of the things it doesn't do well is
POSIX-spec locale support.

As you now know from having looked at the archives, there've been many
discussions of changing PG to not rely on the platform's locale
support.  But the bang-for-buck ratio of any such change doesn't seem
very attractive --- we have a lot of higher priority things to spend
our finite development manpower on.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-13 Thread Craig Ringer
Scott Marlowe wrote:
 On Wed, Jan 13, 2010 at 11:15 PM, Craig Ringer
 cr...@postnewspapers.com.au wrote:
 Perhaps someone who wants to use Mac OS X and Pg for their product will
 come forward with some compat wrapper functions for the localizable
 libc/posix functions, so Pg can just be built against the wrapper and
 the rest of us need not care about OS X's bugs.
 
 I know this sounds crazy, but couldn't Apple be bothered to fix their OS?  :)

I see you don't use Macs much :-P

( Goes back to researching a way to make network automounts on login
work and properly support search in 10.6, since the only way that used
to work in 10.4 is broken by 10.6 )

--
Craig Ringer

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Collate order on Mac OS X, text with diacritics in UTF-8

2010-01-12 Thread Martin Flahault
Hi,

We are a software publisher searching for a new DBMS for our software. We have 
more than one hundred installed servers, running Mac OS and a Primebase 
database.

We have spend some time evaluating PostgreSQL and we can't get correct outputs 
with the ORDER BY command.
LC_COLLATE and LC_CTYPE are set to fr_FR.UTF-8.

It seems there is a known problem with the collating order of text including 
diacritics with the UTF8 encodings on BSD systems.

Does anyone know a workaround ?

Thanks for your answers,

Martin