Re: Wrong sorting on docker image
On Sun, Oct 17, 2021 at 4:42 AM Tom Lane wrote: > Speaking of ICU, if you are using an ICU-enabled Postgres build, > maybe you could find an ICU collation that acts the way you want. > This wouldn't be a perfect solution, because we don't yet have > the ability to set an ICU collation as a database's default. > But you can attach ICU collations to individual text columns, > and maybe that would be a good enough workaround. For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with glibc on these sort orders, so I suspect this might be coming from CLDR/UCA/DUCET/ISO 14651 common/synchronised data. It does look quite suspicious to me, but I don't know Russian and I'm only speculating wildly here: it does look as if ё is perhaps getting a lower weight than it should. That said, it seems strange that something so basic should be wrong. Nosing around in the unicode.org issue tracker, it seems as though some people might think there is something funny about Ё (and I wonder if there are/were similar issues with й/Й): https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22 https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22 (and more) It's probably not a great idea, but for the record, you can build your own collation for glibc and other POSIX-oid systems. For example, see glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they previously had customisations on top of the iso14651_t1 file to reorder a special Ukrainian character in ru_RU, so in theory you could reorder ё/Ё with a similar local hack and call it ru_RU_X... I also wonder if there is some magic switch you can put after an @ symbol on ICU collations that would change this, perhaps some way to disable the "contractions" that are potentially implicated here. Not sure.
Re: "two time periods with only an endpoint in common do not overlap" ???
On 10/15/21 21:54, Ron wrote: There is no straight time range, you would have to use tsrange or tstzrange. The principle still holds though you can make ranges overlap or not depending on '[)' or '[]'. OP refers to the OVERLAP operator (is it an operator), not the tsrange() function. Your statement was: "The numeric ranges 0-10 and 10-19 overlap, just as the time ranges 00:01:00-00:00:02:00 overlaps 00:02:00-00:03:00." I was just pointing out that is not necessarily true. As was pointed out upstream there are good reasons for not having 1:00-2:00 and 2:00-3:00 overlap. As David pointed out it is about following the documented behavior. I still have to remember, on occasion, that BETWEEN actually includes the end points not just points in between them. -- Adrian Klaver adrian.kla...@aklaver.com
Re: Wrong sorting on docker image
Oleksandr Voytsekhovskyy writes: > Starting from version 12.0 official docker image switched from Debian-stretch > to Debian-bullseye and from that point we have a huge pain with sorting > issues on Russian collation. Yeah, Debian versions after stretch adopted the significant glibc locale data changes (sorting rule changes) that are discussed at [1]. > ея should go before ёа I'm not qualified to have an opinion on that point, but one would hope that the glibc people who changed the sorting rules are qualified. If you disagree, you need to go discuss it with glibc. Postgres doesn't define any text sorting rules, we just use what libc or ICU tells us. Speaking of ICU, if you are using an ICU-enabled Postgres build, maybe you could find an ICU collation that acts the way you want. This wouldn't be a perfect solution, because we don't yet have the ability to set an ICU collation as a database's default. But you can attach ICU collations to individual text columns, and maybe that would be a good enough workaround. regards, tom lane [1] https://wiki.postgresql.org/wiki/Locale_data_changes
Re: Wrong sorting on docker image
On 2021-10-16 13:50:31 +0300, Oleksandr Voytsekhovskyy wrote: > Starting from version 12.0 official docker image switched from Debian-stretch > to Debian-bullseye and from that point we have a huge pain with sorting issues > on Russian collation. [...] > Issue: > > postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name; > name > -- > ёа > ея > (2 строки) > > > > еqя should go before ёqа Same with the sort command in the shell. > postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU"; > ?column? > -- > f > (1 строка) > > And should be TRUE here > > Any idea how to fix that? Since the collation is defined by the OS (or rather its C library) I think this should be reported to Debian or possibly the glibc maintainers. PostgreSQL can also use the ICU locales instead of those provided by the OS. Have you tried that? hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Wrong sorting on docker image
Greetings Starting from version 12.0 official docker image switched from Debian-stretch to Debian-bullseye and from that point we have a huge pain with sorting issues on Russian collation. Dockerfile: FROM postgres:14 RUN apt-get clean && apt-get update && apt-get install -y locales RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8 ENV LANG ru_RU.utf8 postgres=# select version(); version - PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit Issue: postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name; name -- ёа ея (2 строки) еqя should go before ёqа postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU"; ?column? -- f (1 строка) And should be TRUE here Any idea how to fix that? We are not able manage this for 3 years already (((
Re: "two time periods with only an endpoint in common do not overlap" ???
On 16/10/21 18:41, David G. Johnston wrote: On Friday, October 15, 2021, Ron wrote: Prima facie, if you were told "numbers in the range 0-10", would you really think, "ah, they *really* mean 0 through 9"? I would indeed default to both endpoints of the range being inclusive. I also begin counting at one, not zero. I’ve long gotten past being surprised when computer science and my defaults don’t agree. Choices are made and documented and that works for me. As for this, documentation I never really gave the wording a second thought before, though I can definitely understand the complaint and like the somewhat wordier, but less linguistically challenging, phrasing the OP suggested (Boundary point, especially by itself, is not an improvement). David J. The reason arrays generally start at zero and not one, is efficiency. When indexes are zero based then the displacement in bytes from the start address of x[n] is simply: startAddress + n * sizeOfElement If the start of an array had the index of one, then you have subtract one each time, so the displacement from the start address of x[n] now becomes startAddress + (n - 1) * sizeOfElement Half open intervals make life a lot simpler so it is the natural default, to prevent intervals from having any numbers in common. If you have 3 intervals spanning the range [0, 30), and you are only dealing with integers then you can split the range as: [0, 9] 0 <= x <= 9 [10, 19] 10 <= x <= 19 [20, 29] 10 <= x <= 29 But what if you are dealing with floats? The above arrangement would not work, as 9.78 would not be in any interval, so you need half open intervals, such as: [0, 10) 0 <= x < 10 [10, 20) 10 <= x < 20 [20, 30) 10 <= x < 30 So you know what number each interval starts at, and every number in the range is covered. -Gavin