Re: [GENERAL] different sort order in windows and linux version
On Mon, Jul 03, 2006 at 12:55:18AM -0400, Tom Lane wrote: > Fine for you, not so fine for other people with different concerns. > > I'm not unsympathetic to your general point, but black-and-white > arguments won't get far in this discussion. It's all about tradeoffs > ... it's most definitely not about one-size-fits-all. What I find most confusing is that the patch for adding ICU to postgres (using a configure switch) has been around for a while, at one point it was even in the queue for 8.2: http://archives.postgresql.org/pgsql-patches/2005-08/msg00309.php It's vanished from there, not sure why. Even so, we still get regular requests for it. Presumably it was removed here: http://archives.postgresql.org/pgsql-patches/2006-03/msg00233.php Given the patch does nothing if you don't enable it, I'm not quite sure what the downsides would be... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [GENERAL] different sort order in windows and linux version
Tom Lane wrote: Dragan Matic <[EMAIL PROTECTED]> writes: I was about to say the same thing. I think that the whole point in having a portable database system is that the data inside the database should behave the _same way_ no matter what operating system database is running on - client shouldn't be aware of the server OS. So on that argument, we need to eliminate datatypes float8 and float4 forthwith, because they don't behave quite the same on every machine. And int8 too, because it's not supported on every machine. And --enable-integer-datetimes has got to go; in fact configure should not have any options at all. Ok, I agree on that, I have completely forgotten these datatypes, when I wrote the above I was having in mind strings only. Personally, I'd be perfectly happy with pgsql if I could choose to make text operations up to 2-3x slower without the fuss of how it's going to work on a certain platform, in each pgsql version. Fine for you, not so fine for other people with different concerns. I'm not unsympathetic to your general point, but black-and-white arguments won't get far in this discussion. It's all about tradeoffs ... it's most definitely not about one-size-fits-all. regards, tom lane Sorry if I was being rude, didn't mean to sound that way, I was just surprised to see this kind of postgresql behavior. I wouldn't like to be misunderstood, we are using postgresql for 6th year now and we are extremely satisfied with it. Furthermore we are using Linux servers only so the mentioned behavior doesn't affect us at the moment. My original question arose when I wanted to do some tests on my notebook and after installing postgresql for windows, I was surprised to see that I was not getting the same results that I was getting from the other servers, so I thought I could ask if I misconfigured the installation, or was that a known bug. My concern was that for someone developing application with postgresql for windows and then deploying it on postgresql on Linux this might be a big surprise. regards Dragan Matic ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] different sort order in windows and linux version
Dragan Matic <[EMAIL PROTECTED]> writes: > I was about to say the same thing. I think that the whole point in > having a portable database system is that the data inside the database > should behave the _same way_ no matter what operating system database is > running on - client shouldn't be aware of the server OS. So on that argument, we need to eliminate datatypes float8 and float4 forthwith, because they don't behave quite the same on every machine. And int8 too, because it's not supported on every machine. And --enable-integer-datetimes has got to go; in fact configure should not have any options at all. >> Personally, I'd be perfectly happy with pgsql if I could choose to >> make text operations up to 2-3x slower without the fuss of how it's >> going to work on a certain platform, in each pgsql version. Fine for you, not so fine for other people with different concerns. I'm not unsympathetic to your general point, but black-and-white arguments won't get far in this discussion. It's all about tradeoffs ... it's most definitely not about one-size-fits-all. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] different sort order in windows and linux version
On Sun, Jul 02, 2006 at 12:25:43PM -0400, Agent M wrote: > On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote: > >But I don't think anyone is actually considering importing ICU into the > >postgres source tree, are they? > Why not? Because it's a project of similar size to postgres and probably nearly as old and I don't think anyone here actually wants to maintain it. I mean, we could incorporate the source for readline, openssl, kerberos, the C library but why. That project has maintainers already and we only wan to use it, not fork it. > >If you drop the conversion stuff (because postgres already has that) > >you're down to about 4MB. > Why would you drop the ICU transcoding support instead of the existing > postgres functions? Why the duplicated effort? Because we would want to be bug-for-bug compatable to previous releases. I suppose it would be possible if someone checked that the end result is the same. > Certain Japanese characters cannot make a reliable round-trip through > Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be > happy with an ICU-only solution. However, it would still be of great > benefit to allow ICU to handle as much as possible, leaving the string > encodings to the encoding experts. We don't need round-trip through unicode, since we're only doing one way conversions for the purpose of collation. BTW, this site seems to have a good discussion of Japanese characters and Unicode. http://www.jbrowse.com/text/unij.html > At the very least, it would be great to have ICU to handle encoding on > a per-column basis (perhaps extending the text datatype with encoding > info). Perhaps this would be a decent stopgap solution? The backend > protocol would also need a version bump- currently, it converts all > strings to a single encoding. That's called SQL COLLATE support and that's an order of magnitude harder than adding support for ICU. See previous dicussion on -hackers. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [GENERAL] different sort order in windows and linux version
On 7/2/06, Agent M <[EMAIL PROTECTED]> wrote: Certain Japanese characters cannot make a reliable round-trip through Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be happy with an ICU-only solution. However, it would still be of great Could you explain what you mean and what's special with those characters? benefit to allow ICU to handle as much as possible, leaving the string encodings to the encoding experts. At the very least, it would be great to have ICU to handle encoding on a per-column basis (perhaps extending the text datatype with encoding info). Perhaps this would be a decent stopgap solution? The backend protocol would also need a version bump- currently, it converts all strings to a single encoding. Could you give an example of what that would look like in your opinion? I was thinking more along the lines of a setting in pg_hba.conf where the server uses or does not use something like ICU...at least as an intermediate solution. Adding a "LOCALE" clause to a column definition (similar to the "ENCODING" clause of the "CREATE DATABASE" statement) would solve most (not all) problems with a default locale. There still might be some non-deterministic behaviour with operations between strings in different locales but it's far from a showstopper. t.n.a. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] different sort order in windows and linux version
On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote: But I don't think anyone is actually considering importing ICU into the postgres source tree, are they? Why not? Size - I'm not sure this is relevent since I don't think we want to incorporate it into postgres itself, just let people use it if they have it. In any case though, the default dataset is 8MB. This includes support for every locale and charset it knows about. If you drop the conversion stuff (because postgres already has that) you're down to about 4MB. Why would you drop the ICU transcoding support instead of the existing postgres functions? Why the duplicated effort? Well, the Japanese think that UTF8 is not the solution to all their worries, so they won't be happy with a UTF8-only solution. Likewise, those of us who only need single-byte character sets won't be very happy with being forced to accept multi-byte processing overhead. I've not quite understood the japenese problem with Unicode. My understanding is that it was primarily due to widespread use of broken converters. Certain Japanese characters cannot make a reliable round-trip through Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be happy with an ICU-only solution. However, it would still be of great benefit to allow ICU to handle as much as possible, leaving the string encodings to the encoding experts. At the very least, it would be great to have ICU to handle encoding on a per-column basis (perhaps extending the text datatype with encoding info). Perhaps this would be a decent stopgap solution? The backend protocol would also need a version bump- currently, it converts all strings to a single encoding. ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ AgentM [EMAIL PROTECTED] ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] different sort order in windows and linux version
On Sun, Jul 02, 2006 at 12:13:02PM +0200, Martijn van Oosterhout wrote: > However, the most important point is that people have said they'll take > the speed hit if they could get consistant collation. I can second that. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [GENERAL] different sort order in windows and linux version
Tomi NA wrote: On 6/30/06, Martijn van Oosterhout wrote: On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote: > I have two postgres servers, one on linux (fedora core 5), one on > windows, both are version 8.1.4. > Not beeing able to depend on the engine to consistently collate strings as simple as the ones Dragan listed is closer to a serious bug (non-deterministic behaviour in otherwise deterministic functions) than a RFE, but is certainly nowhere near "it's not our problem" as it regularly seems made up to be. The OS(es) simply and obviously do(es)n't do a good enough job of it. I was about to say the same thing. I think that the whole point in having a portable database system is that the data inside the database should behave the _same way_ no matter what operating system database is running on - client shouldn't be aware of the server OS. This is clearly not the case here. Furthermore, the same thing happens even with en_US (on Linux) and English_United States (on windows) collations selected, so it is definitely a serious issue with US collation also and not with some exotic collation orders only. I think that the only case where it doesn't happen is when "C" collation is selected. It might be interesting to see how this issue behaves on other operating systems. In the past there have existed patches to allow postgres to use ICU for locale support. It's supposedly not quite as fast, but you will be able get consistant results across platforms. Personally, I'd be perfectly happy with pgsql if I could choose to make text operations up to 2-3x slower without the fuss of how it's going to work on a certain platform, in each pgsql version. Furthermore, compiling the server myself is not an option for live usage: on my current project, I'm not even the one installing the database servers...sending administrators a binary I configured and compiled (on Windows, in this case!) and noone but me tested...b...I get the shivers just thinking about it. Recompiling is not an option for me also, I mean I could do it for an in-house servers where I am in charge, but our application runs on many places and on many servers where recompiling postgres with some third-party patches is out of the question. I think the solution where postgres would be slower but behaved the same way on all supported operating systems would also be acceptable for most people. Dragan ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [GENERAL] different sort order in windows and linux version
On Sat, Jul 01, 2006 at 06:23:07PM -0400, Tom Lane wrote: > "Tomi NA" <[EMAIL PROTECTED]> writes: > > Basically, it comes down to three possibilities, doesn't it: > > 1.) use an existing library > > 2.) write a pgsql specific implementation > > 3.) forget about it and tend to other issues > > > Personally, I don't really care if it's 1) or 2): I'm just afraid it's > > going to be 3). > > Is this a licencing issue (with regard to ICU beeing under the IBM > > public licence)? > > Licensing is a concern --- IBM's appears to be not quite BSD enough. > Size and portability of the library are concerns. Performance is a > concern. Whether the patch makes the library required or optional is > a concern (if required, the portability issue becomes a whole lot more > urgent). Loss of existing functionality is a concern --- for instance, > if the patch is such that UTF8 becomes the only supported server > encoding, it'll probably be rejected forthwith. Licence - It's the X/MIT licence, which is almost identical to the BSD licence. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html http://en.wikipedia.org/wiki/MIT_License But I don't think anyone is actually considering importing ICU into the postgres source tree, are they? Size - I'm not sure this is relevent since I don't think we want to incorporate it into postgres itself, just let people use it if they have it. In any case though, the default dataset is 8MB. This includes support for every locale and charset it knows about. If you drop the conversion stuff (because postgres already has that) you're down to about 4MB. Since ICU supports userdefined tables, we could provide a single cross-platform dataset and get the user's ICU library implementation to use that. Portability - ICU runs on all the platforms postgres does, AFAICS. http://dev.icu-project.org/cgi-bin/viewcvs.cgi/icu/readme.html?rev=release-3-4#HowToBuildSupported Performance - ICU is approximatly four times faster than glibc for collation. Even once you include keygen time (including conversion) it comes out about 40% faster. http://icu.sourceforge.net/charts/collation_icu4c_glibc.html ICU is not slow. > Well, the Japanese think that UTF8 is not the solution to all their > worries, so they won't be happy with a UTF8-only solution. Likewise, > those of us who only need single-byte character sets won't be very happy > with being forced to accept multi-byte processing overhead. I've not quite understood the japenese problem with Unicode. My understanding is that it was primarily due to widespread use of broken converters. In any case, ICU appears to beat glibc with single byte encodings, even including the multi-byte conversion. However, the most important point is that people have said they'll take the speed hit if they could get consistant collation. For speed you can always throw more hardware. But no amount of hardware will fix your collation issues. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [GENERAL] different sort order in windows and linux version
"Tomi NA" <[EMAIL PROTECTED]> writes: > Basically, it comes down to three possibilities, doesn't it: > 1.) use an existing library > 2.) write a pgsql specific implementation > 3.) forget about it and tend to other issues > Personally, I don't really care if it's 1) or 2): I'm just afraid it's > going to be 3). > Is this a licencing issue (with regard to ICU beeing under the IBM > public licence)? Licensing is a concern --- IBM's appears to be not quite BSD enough. Size and portability of the library are concerns. Performance is a concern. Whether the patch makes the library required or optional is a concern (if required, the portability issue becomes a whole lot more urgent). Loss of existing functionality is a concern --- for instance, if the patch is such that UTF8 becomes the only supported server encoding, it'll probably be rejected forthwith. > A plugin architecture (to get rid of licencing headaches) issue? AFAIK making it a "plugin" won't alleviate anyone's licensing worries. Certainly that's not going to answer if the library is GPL. > To be perfectly honest, I've had to tackle so many problems with > encodings during the years I'd make it punishable by law to use > anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod > is. (-: Well, the Japanese think that UTF8 is not the solution to all their worries, so they won't be happy with a UTF8-only solution. Likewise, those of us who only need single-byte character sets won't be very happy with being forced to accept multi-byte processing overhead. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [GENERAL] different sort order in windows and linux version
On 7/1/06, Martijn van Oosterhout wrote: On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote: > If I sound harsh, please excuse me, but I feel like I'm the only one > who thinks these encoding problems (collation, upper/lowercase, > multiple languages in a single database) are serious...nobody seems to > share the sentiment. Ah well... I agree with you, however the resistance (AFAICS) comes mostly from the fact that we would be depending on an external library to do it. I don't think postgres should try doing it itself, given that the unicode character databases are quite large by themselves. Alternativly, the postgres group could produce a customised version of ICU that's smaller (the website has details about how). But any case, this problem will need to be addressed at some point. Basically, it comes down to three possibilities, doesn't it: 1.) use an existing library 2.) write a pgsql specific implementation 3.) forget about it and tend to other issues Personally, I don't really care if it's 1) or 2): I'm just afraid it's going to be 3). Is this a licencing issue (with regard to ICU beeing under the IBM public licence)? A plugin architecture (to get rid of licencing headaches) issue? Are there any other libraries that might do the job? To be perfectly honest, I've had to tackle so many problems with encodings during the years I'd make it punishable by law to use anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod is. (-: t.n.a. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [GENERAL] different sort order in windows and linux version
On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote: > If I sound harsh, please excuse me, but I feel like I'm the only one > who thinks these encoding problems (collation, upper/lowercase, > multiple languages in a single database) are serious...nobody seems to > share the sentiment. Ah well... I agree with you, however the resistance (AFAICS) comes mostly from the fact that we would be depending on an external library to do it. I don't think postgres should try doing it itself, given that the unicode character databases are quite large by themselves. Alternativly, the postgres group could produce a customised version of ICU that's smaller (the website has details about how). But any case, this problem will need to be addressed at some point. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [GENERAL] different sort order in windows and linux version
On 6/30/06, Martijn van Oosterhout wrote: On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote: > I have two postgres servers, one on linux (fedora core 5), one on > windows, both are version 8.1.4. > > Both databases are initialized with locale Croatian and win1250 encoding. > > running pg_controldata on windows returns this > > LC_COLLATE: Croatian_Croatia.1250 > LC_CTYPE:Croatian_Croatia.1250 > > the same command on linux returns this > > LC_COLLATE:hr_HR > LC_CTYPE: hr_HR > > which is the same, I suppose. Well, apparently not. Postgres makes no attempt to understand collations nor try to determine whether they make sense. If you want to have the same collation on Windows and Linux, I think you're going to have trouble. Croatian_Croatia and hr_HR are, in fact, the same in that there is no other collation for the Croatian language. Whatsmore, Dragan ran the test using characters which are encoded exactly the same in cp1250, utf8, iso8859-2, hell, probably even us-ascii. The fact remains that different OSes collate differently, even for the same locale. In C++, people use things like GTK, wxWidgets and GCL so that they could think about "C++ code instead of the platform they're coding on. In Java, people use things like File.separator instead of "\" or "/" so that they could think about "Java code". There are dozens of examples like these and most of the exceptions stem from the influence of the at the time monopoly-holder. When you code in the RDBMS environment, you want to code in terms of pgsql or Oracle or MySQL or whatever: you don't want to program for Oracle on Solaris vs. Oracle on Linux vs. Oracle on Plan9 or...well, you get the idea. Not beeing able to depend on the engine to consistently collate strings as simple as the ones Dragan listed is closer to a serious bug (non-deterministic behaviour in otherwise deterministic functions) than a RFE, but is certainly nowhere near "it's not our problem" as it regularly seems made up to be. The OS(es) simply and obviously do(es)n't do a good enough job of it. In the past there have existed patches to allow postgres to use ICU for locale support. It's supposedly not quite as fast, but you will be able get consistant results across platforms. Personally, I'd be perfectly happy with pgsql if I could choose to make text operations up to 2-3x slower without the fuss of how it's going to work on a certain platform, in each pgsql version. Furthermore, compiling the server myself is not an option for live usage: on my current project, I'm not even the one installing the database servers...sending administrators a binary I configured and compiled (on Windows, in this case!) and noone but me tested...b...I get the shivers just thinking about it. If I sound harsh, please excuse me, but I feel like I'm the only one who thinks these encoding problems (collation, upper/lowercase, multiple languages in a single database) are serious...nobody seems to share the sentiment. Ah well... t.n.a. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [GENERAL] different sort order in windows and linux version
On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote: > I have two postgres servers, one on linux (fedora core 5), one on > windows, both are version 8.1.4. > > Both databases are initialized with locale Croatian and win1250 encoding. > > running pg_controldata on windows returns this > > LC_COLLATE: Croatian_Croatia.1250 > LC_CTYPE:Croatian_Croatia.1250 > > the same command on linux returns this > > LC_COLLATE:hr_HR > LC_CTYPE: hr_HR > > which is the same, I suppose. Well, apparently not. Postgres makes no attempt to understand collations nor try to determine whether they make sense. If you want to have the same collation on Windows and Linux, I think you're going to have trouble. In the past there have existed patches to allow postgres to use ICU for locale support. It's supposedly not quite as fast, but you will be able get consistant results across platforms. Hope this helps, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
[GENERAL] different sort order in windows and linux version
I have two postgres servers, one on linux (fedora core 5), one on windows, both are version 8.1.4. Both databases are initialized with locale Croatian and win1250 encoding. running pg_controldata on windows returns this LC_COLLATE: Croatian_Croatia.1250 LC_CTYPE:Croatian_Croatia.1250 the same command on linux returns this LC_COLLATE:hr_HR LC_CTYPE: hr_HR which is the same, I suppose. the sample databases are both initialized the same way CREATE DATABASE sample WITH OWNER = postgres ENCODING = 'WIN1250' TABLESPACE = pg_default; both databases have the same sample table CREATE TABLE sample ( some_text char(13) NOT NULL ); when I execute a query 'SELECT SOME_TEXT FROM SAMPLE ORDER BY SOME_TEXT' I get different sort order on these two servers. On the left side is windows server sort order, and on the right side is linux server sort order. all values are left padded with spaces. Postgres windows Postgres linux 0 0 1 000 2 0 3 1 4 11 5 12 6 123 7 125 8 13 9 14 a 15 A 2 b 2343255 b 234455 B 243 c 25 C 3 d 31TA001 e 32NU280 f 35 g 4 z 45 11 5 12 55 13 56455 14 6 15 65 25 7 35 75 45 8 55 85 65 9 75 a 85 A aa aa ab aaa aB aab Ab ab AB aB ba Ab bb AB cc aba 000 abA 123 aBa 125 AbA 243 ABA aaa abb aab abb aba abc abA abc aBa abC AbA Abc ABA ABC abb abcde abb acc abc b abc b abC B Abc ba ABC baa acc bab baa bb bab bba bba bbb bbb bca bca c zzz C zzz cc 56455 d abcde e 234455 f 2343255 g 31TA001 z 32NU280 zzz 0 zzz z z The only thing I can think of is that somehow these databases weren't initialized in the same way, if so what is the difference? Tnx in advance Dragan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match