Re: [GENERAL] different sort order in windows and linux version

2006-07-03 Thread Dragan Matic

Tom Lane wrote:

Dragan Matic [EMAIL PROTECTED] writes:
  
I was about to say the same thing. I think that the whole point in 
having a portable database system is that the data inside the database 
should behave the _same way_ no matter what operating system database is 
running on - client shouldn't be aware of the server OS.



So on that argument, we need to eliminate datatypes float8 and float4
forthwith, because they don't behave quite the same on every machine.
And int8 too, because it's not supported on every machine.  And
--enable-integer-datetimes has got to go; in fact configure should
not have any options at all.

  
Ok, I agree on that, I have completely forgotten these datatypes, when I 
wrote the above I was having in mind strings only.

Personally, I'd be perfectly happy with pgsql if I could choose to
make text operations up to 2-3x slower without the fuss of how it's
going to work on a certain platform, in each pgsql version.
  


Fine for you, not so fine for other people with different concerns.

I'm not unsympathetic to your general point, but black-and-white
arguments won't get far in this discussion.  It's all about tradeoffs
... it's most definitely not about one-size-fits-all.

regards, tom lane


  
Sorry if I was being rude, didn't mean to sound that way, I was just 
surprised to see this kind of postgresql behavior. I wouldn't like to be 
misunderstood, we are using postgresql for 6th year now and we are 
extremely satisfied with it. Furthermore we are using Linux servers only 
so the mentioned behavior doesn't affect us at the moment. My original 
question arose when I wanted to do some tests on my notebook and after 
installing postgresql for windows, I was surprised to see that I was not 
getting the same results that I was getting from the other servers, so I 
thought I could ask if I misconfigured the installation, or was that a 
known bug. My concern was that for someone developing application with 
postgresql for windows and then deploying it on postgresql on Linux this 
might be a big surprise.


regards

Dragan Matic




---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] different sort order in windows and linux version

2006-07-03 Thread Martijn van Oosterhout
On Mon, Jul 03, 2006 at 12:55:18AM -0400, Tom Lane wrote:
 Fine for you, not so fine for other people with different concerns.
 
 I'm not unsympathetic to your general point, but black-and-white
 arguments won't get far in this discussion.  It's all about tradeoffs
 ... it's most definitely not about one-size-fits-all.

What I find most confusing is that the patch for adding ICU to postgres
(using a configure switch) has been around for a while, at one point it
was even in the queue for 8.2:

http://archives.postgresql.org/pgsql-patches/2005-08/msg00309.php

It's vanished from there, not sure why. Even so, we still get regular
requests for it. Presumably it was removed here:

http://archives.postgresql.org/pgsql-patches/2006-03/msg00233.php

Given the patch does nothing if you don't enable it, I'm not quite sure
what the downsides would be...

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Martijn van Oosterhout
On Sat, Jul 01, 2006 at 06:23:07PM -0400, Tom Lane wrote:
 Tomi NA [EMAIL PROTECTED] writes:
  Basically, it comes down to three possibilities, doesn't it:
  1.) use an existing library
  2.) write a pgsql specific implementation
  3.) forget about it and tend to other issues
 
  Personally, I don't really care if it's 1) or 2): I'm just afraid it's
  going to be 3).
  Is this a licencing issue (with regard to ICU beeing under the IBM
  public licence)?
 
 Licensing is a concern --- IBM's appears to be not quite BSD enough.
 Size and portability of the library are concerns.  Performance is a
 concern.  Whether the patch makes the library required or optional is
 a concern (if required, the portability issue becomes a whole lot more
 urgent).  Loss of existing functionality is a concern --- for instance,
 if the patch is such that UTF8 becomes the only supported server
 encoding, it'll probably be rejected forthwith.

Licence - It's the X/MIT licence, which is almost identical to the BSD
licence.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html
http://en.wikipedia.org/wiki/MIT_License

But I don't think anyone is actually considering importing ICU into the
postgres source tree, are they?

Size - I'm not sure this is relevent since I don't think we want to
incorporate it into postgres itself, just let people use it if they
have it. In any case though, the default dataset is 8MB. This includes
support for every locale and charset it knows about.

If you drop the conversion stuff (because postgres already has that)
you're down to about 4MB.

Since ICU supports userdefined tables, we could provide a single
cross-platform dataset and get the user's ICU library implementation to
use that.

Portability - ICU runs on all the platforms postgres does, AFAICS.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/icu/readme.html?rev=release-3-4#HowToBuildSupported

Performance - ICU is approximatly four times faster than glibc for
collation. Even once you include keygen time (including conversion) it
comes out about 40% faster.

http://icu.sourceforge.net/charts/collation_icu4c_glibc.html

ICU is not slow.

 Well, the Japanese think that UTF8 is not the solution to all their
 worries, so they won't be happy with a UTF8-only solution.  Likewise,
 those of us who only need single-byte character sets won't be very happy
 with being forced to accept multi-byte processing overhead.

I've not quite understood the japenese problem with Unicode. My
understanding is that it was primarily due to widespread use of broken
converters.

In any case, ICU appears to beat glibc with single byte encodings, even
including the multi-byte conversion.

However, the most important point is that people have said they'll take
the speed hit if they could get consistant collation. For speed you can
always throw more hardware. But no amount of hardware will fix your
collation issues.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Dragan Matic

Tomi NA wrote:

On 6/30/06, Martijn van Oosterhout kleptog@svana.org wrote:

On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote:
 I have two postgres servers, one on linux (fedora core 5), one on
 windows, both are version 8.1.4.
 


Not beeing able to depend on the engine to consistently collate
strings as simple as the ones Dragan listed is closer to a serious bug
(non-deterministic behaviour in otherwise deterministic functions)
than a RFE, but is certainly nowhere near it's not our problem as it
regularly seems made up to be. The OS(es) simply and obviously
do(es)n't do a good enough job of it.

I was about to say the same thing. I think that the whole point in 
having a portable database system is that the data inside the database 
should behave the _same way_ no matter what operating system database is 
running on - client shouldn't be aware of the server OS. This is clearly 
not the case here. Furthermore, the same thing happens even with en_US 
(on Linux) and English_United States (on windows) collations selected, 
so it is definitely a serious issue with US collation also and not with 
some exotic collation orders only. I think that the only case where it 
doesn't happen is when C collation is selected. It might be 
interesting to see how this issue behaves on other operating systems.



In the past there have existed patches to allow postgres to use ICU for
locale support. It's supposedly not quite as fast, but you will be able
get consistant results across platforms.


Personally, I'd be perfectly happy with pgsql if I could choose to
make text operations up to 2-3x slower without the fuss of how it's
going to work on a certain platform, in each pgsql version.
Furthermore, compiling the server myself is not an option for live
usage: on my current project, I'm not even the one installing the
database servers...sending administrators a binary I configured and
compiled (on Windows, in this case!) and noone but me
tested...b...I get the shivers just thinking about it.
   Recompiling is not an option for me also, I mean I could do it for 
an in-house servers where I am in charge, but our application runs on 
many places and on many servers where recompiling postgres with some 
third-party patches is out of the question. I think the solution where 
postgres would be slower but behaved the same way on all supported 
operating systems would also be acceptable for most people.


Dragan


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Karsten Hilbert
On Sun, Jul 02, 2006 at 12:13:02PM +0200, Martijn van Oosterhout wrote:

 However, the most important point is that people have said they'll take
 the speed hit if they could get consistant collation.
I can second that.

Karsten
-- 
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Agent M

On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote:

But I don't think anyone is actually considering importing ICU into the
postgres source tree, are they?

Why not?


Size - I'm not sure this is relevent since I don't think we want to
incorporate it into postgres itself, just let people use it if they
have it. In any case though, the default dataset is 8MB. This includes
support for every locale and charset it knows about.

If you drop the conversion stuff (because postgres already has that)
you're down to about 4MB.
Why would you drop the ICU transcoding support instead of the existing 
postgres functions? Why the duplicated effort?




Well, the Japanese think that UTF8 is not the solution to all their
worries, so they won't be happy with a UTF8-only solution.  Likewise,
those of us who only need single-byte character sets won't be very 
happy

with being forced to accept multi-byte processing overhead.


I've not quite understood the japenese problem with Unicode. My
understanding is that it was primarily due to widespread use of broken
converters.


Certain Japanese characters cannot make a reliable round-trip through 
Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be 
happy with an ICU-only solution. However, it would still be of great 
benefit to allow ICU to handle as much as possible, leaving the string 
encodings to the encoding experts.


At the very least, it would be great to have ICU to handle encoding on 
a per-column basis (perhaps extending the text datatype with encoding 
info). Perhaps this would be a decent stopgap solution? The backend 
protocol would also need a version bump- currently, it converts all 
strings to a single encoding.


¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
[EMAIL PROTECTED]
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬


---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Tomi NA

On 7/2/06, Agent M [EMAIL PROTECTED] wrote:


Certain Japanese characters cannot make a reliable round-trip through
Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be
happy with an ICU-only solution. However, it would still be of great


Could you explain what you mean and what's special with those characters?


benefit to allow ICU to handle as much as possible, leaving the string
encodings to the encoding experts.

At the very least, it would be great to have ICU to handle encoding on
a per-column basis (perhaps extending the text datatype with encoding
info). Perhaps this would be a decent stopgap solution? The backend
protocol would also need a version bump- currently, it converts all
strings to a single encoding.


Could you give an example of what that would look like in your opinion?
I was thinking more along the lines of a setting in pg_hba.conf where
the server uses or does not use something like ICU...at least as an
intermediate solution.
Adding a LOCALE clause to a column definition (similar to the
ENCODING clause of the CREATE DATABASE statement) would solve most
(not all) problems with a default locale.
There still might be some non-deterministic behaviour with operations
between strings in different locales but it's far from a showstopper.

t.n.a.

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Martijn van Oosterhout
On Sun, Jul 02, 2006 at 12:25:43PM -0400, Agent M wrote:
 On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote:
 But I don't think anyone is actually considering importing ICU into the
 postgres source tree, are they?
 Why not?

Because it's a project of similar size to postgres and probably nearly
as old and I don't think anyone here actually wants to maintain it.

I mean, we could incorporate the source for readline, openssl,
kerberos, the C library but why. That project has maintainers already
and we only wan to use it, not fork it.

 If you drop the conversion stuff (because postgres already has that)
 you're down to about 4MB.
 Why would you drop the ICU transcoding support instead of the existing 
 postgres functions? Why the duplicated effort?

Because we would want to be bug-for-bug compatable to previous
releases. I suppose it would be possible if someone checked that the
end result is the same.

 Certain Japanese characters cannot make a reliable round-trip through 
 Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be 
 happy with an ICU-only solution. However, it would still be of great 
 benefit to allow ICU to handle as much as possible, leaving the string 
 encodings to the encoding experts.

We don't need round-trip through unicode, since we're only doing one
way conversions for the purpose of collation.

BTW, this site seems to have a good discussion of Japanese characters
and Unicode.

http://www.jbrowse.com/text/unij.html

 At the very least, it would be great to have ICU to handle encoding on 
 a per-column basis (perhaps extending the text datatype with encoding 
 info). Perhaps this would be a decent stopgap solution? The backend 
 protocol would also need a version bump- currently, it converts all 
 strings to a single encoding.

That's called SQL COLLATE support and that's an order of magnitude
harder than adding support for ICU. See previous dicussion on -hackers.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] different sort order in windows and linux version

2006-07-02 Thread Tom Lane
Dragan Matic [EMAIL PROTECTED] writes:
 I was about to say the same thing. I think that the whole point in 
 having a portable database system is that the data inside the database 
 should behave the _same way_ no matter what operating system database is 
 running on - client shouldn't be aware of the server OS.

So on that argument, we need to eliminate datatypes float8 and float4
forthwith, because they don't behave quite the same on every machine.
And int8 too, because it's not supported on every machine.  And
--enable-integer-datetimes has got to go; in fact configure should
not have any options at all.

 Personally, I'd be perfectly happy with pgsql if I could choose to
 make text operations up to 2-3x slower without the fuss of how it's
 going to work on a certain platform, in each pgsql version.

Fine for you, not so fine for other people with different concerns.

I'm not unsympathetic to your general point, but black-and-white
arguments won't get far in this discussion.  It's all about tradeoffs
... it's most definitely not about one-size-fits-all.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [GENERAL] different sort order in windows and linux version

2006-07-01 Thread Martijn van Oosterhout
On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote:
 If I sound harsh, please excuse me, but I feel like I'm the only one
 who thinks these encoding problems (collation, upper/lowercase,
 multiple languages in a single database) are serious...nobody seems to
 share the sentiment. Ah well...

I agree with you, however the resistance (AFAICS) comes mostly from the
fact that we would be depending on an external library to do it. I
don't think postgres should try doing it itself, given that the unicode
character databases are quite large by themselves.

Alternativly, the postgres group could produce a customised version of
ICU that's smaller (the website has details about how). But any case,
this problem will need to be addressed at some point.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] different sort order in windows and linux version

2006-07-01 Thread Tomi NA

On 7/1/06, Martijn van Oosterhout kleptog@svana.org wrote:

On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote:
 If I sound harsh, please excuse me, but I feel like I'm the only one
 who thinks these encoding problems (collation, upper/lowercase,
 multiple languages in a single database) are serious...nobody seems to
 share the sentiment. Ah well...

I agree with you, however the resistance (AFAICS) comes mostly from the
fact that we would be depending on an external library to do it. I
don't think postgres should try doing it itself, given that the unicode
character databases are quite large by themselves.

Alternativly, the postgres group could produce a customised version of
ICU that's smaller (the website has details about how). But any case,
this problem will need to be addressed at some point.


Basically, it comes down to three possibilities, doesn't it:
1.) use an existing library
2.) write a pgsql specific implementation
3.) forget about it and tend to other issues

Personally, I don't really care if it's 1) or 2): I'm just afraid it's
going to be 3).
Is this a licencing issue (with regard to ICU beeing under the IBM
public licence)? A plugin architecture (to get rid of licencing
headaches) issue?
Are there any other libraries that might do the job?

To be perfectly honest, I've had to tackle so many problems with
encodings during the years I'd make it punishable by law to use
anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod
is. (-:

t.n.a.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] different sort order in windows and linux version

2006-07-01 Thread Tom Lane
Tomi NA [EMAIL PROTECTED] writes:
 Basically, it comes down to three possibilities, doesn't it:
 1.) use an existing library
 2.) write a pgsql specific implementation
 3.) forget about it and tend to other issues

 Personally, I don't really care if it's 1) or 2): I'm just afraid it's
 going to be 3).
 Is this a licencing issue (with regard to ICU beeing under the IBM
 public licence)?

Licensing is a concern --- IBM's appears to be not quite BSD enough.
Size and portability of the library are concerns.  Performance is a
concern.  Whether the patch makes the library required or optional is
a concern (if required, the portability issue becomes a whole lot more
urgent).  Loss of existing functionality is a concern --- for instance,
if the patch is such that UTF8 becomes the only supported server
encoding, it'll probably be rejected forthwith.

 A plugin architecture (to get rid of licencing headaches) issue?

AFAIK making it a plugin won't alleviate anyone's licensing worries.
Certainly that's not going to answer if the library is GPL.

 To be perfectly honest, I've had to tackle so many problems with
 encodings during the years I'd make it punishable by law to use
 anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod
 is. (-:

Well, the Japanese think that UTF8 is not the solution to all their
worries, so they won't be happy with a UTF8-only solution.  Likewise,
those of us who only need single-byte character sets won't be very happy
with being forced to accept multi-byte processing overhead.

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[GENERAL] different sort order in windows and linux version

2006-06-30 Thread Dragan Matic
I have two postgres servers, one on linux (fedora core 5), one on 
windows, both are version 8.1.4.


Both databases are initialized with locale Croatian and win1250 encoding.

running pg_controldata on windows returns this

LC_COLLATE:  Croatian_Croatia.1250
LC_CTYPE:Croatian_Croatia.1250

the same command on linux returns this

LC_COLLATE:hr_HR
LC_CTYPE:  hr_HR

which is the same, I suppose.

the sample databases are both initialized the same way

CREATE DATABASE sample
 WITH OWNER = postgres
  ENCODING = 'WIN1250'
  TABLESPACE = pg_default;


both databases have the same sample table

CREATE TABLE sample
(
 some_text char(13) NOT NULL
);


when I execute a query 'SELECT SOME_TEXT FROM SAMPLE ORDER BY SOME_TEXT' 
I get different sort order on these two servers.


On the left side is windows server sort order, and on the right side is 
linux server sort order. all values are left padded with spaces.


Postgres windows  Postgres linux
  
  0   0

  1 000
  2   0
  3   1
  4  11
  5  12
  6 123
  7 125
  8  13
  9  14
  a  15
  A   2
  b 2343255
  b  234455
  B 243
  c  25
  C   3
  d 31TA001
  e 32NU280
  f  35
  g   4
  z  45
 11   5
 12  55
 13   56455
 14   6
 15  65
 25   7
 35  75
 45   8
 55  85
 65   9
 75   a
 85   A
 aa  aa
 ab aaa
 aB aab
 Ab  ab
 AB  aB
 ba  Ab
 bb  AB
 cc aba
000 abA
123 aBa
125 AbA
243 ABA
aaa abb
aab abb
aba abc
abA abc
aBa abC
AbA Abc
ABA ABC
abb   abcde
abb acc
abc   b
abc   b
abC   B
Abc  ba
ABC baa
acc bab
baa  bb
bab bba
bba bbb
bbb bca
bca   c
zzz   C
zzz  cc
  56455   d
  abcde   e
 234455   f
2343255   g
31TA001   z
32NU280 zzz
  0 zzz
  z   z

The only thing I can think of is that somehow these databases weren't 
initialized in the same way, if so what is the difference?


Tnx in advance

Dragan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] different sort order in windows and linux version

2006-06-30 Thread Martijn van Oosterhout
On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote:
 I have two postgres servers, one on linux (fedora core 5), one on 
 windows, both are version 8.1.4.
 
 Both databases are initialized with locale Croatian and win1250 encoding.
 
 running pg_controldata on windows returns this
 
 LC_COLLATE:  Croatian_Croatia.1250
 LC_CTYPE:Croatian_Croatia.1250
 
 the same command on linux returns this
 
 LC_COLLATE:hr_HR
 LC_CTYPE:  hr_HR
 
 which is the same, I suppose.

Well, apparently not. Postgres makes no attempt to understand
collations nor try to determine whether they make sense. If you want to
have the same collation on Windows and Linux, I think you're going to
have trouble.

In the past there have existed patches to allow postgres to use ICU for
locale support. It's supposedly not quite as fast, but you will be able
get consistant results across platforms.

Hope this helps,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature