Re: Wrong sorting on docker image

2021-10-16 Thread Thomas Munro
On Sun, Oct 17, 2021 at 4:42 AM Tom Lane  wrote:
> Speaking of ICU, if you are using an ICU-enabled Postgres build,
> maybe you could find an ICU collation that acts the way you want.
> This wouldn't be a perfect solution, because we don't yet have
> the ability to set an ICU collation as a database's default.
> But you can attach ICU collations to individual text columns,
> and maybe that would be a good enough workaround.

For what it's worth, ICU's "ru-RU-x-icu" and FreeBSD's libc agree with
glibc on these sort orders, so I suspect this might be coming from
CLDR/UCA/DUCET/ISO 14651 common/synchronised data.  It does look quite
suspicious to me, but I don't know Russian and I'm only speculating
wildly here: it does look as if ё is perhaps getting a lower weight
than it should.  That said, it seems strange that something so basic
should be wrong.  Nosing around in the unicode.org issue tracker, it
seems as though some people might think there is something funny about
Ё (and I wonder if there are/were similar issues with й/Й):

https://unicode-org.atlassian.net/browse/CLDR-2745?jql=text%20~%20%22%D0%81%22
https://unicode-org.atlassian.net/browse/CLDR-1974?jql=text%20~%20%22%D0%81%22
(and more)

It's probably not a great idea, but for the record, you can build your
own collation for glibc and other POSIX-oid systems.  For example, see
glibc commit 159738548130d5ac4fe6178977e940ed5f8cfdc4, where they
previously had customisations on top of the iso14651_t1 file to
reorder a special Ukrainian character in ru_RU, so in theory you could
reorder ё/Ё with a similar local hack and call it ru_RU_X...  I also
wonder if there is some magic switch you can put after an @ symbol on
ICU collations that would change this, perhaps some way to disable the
"contractions" that are potentially implicated here.  Not sure.




Re: "two time periods with only an endpoint in common do not overlap" ???

2021-10-16 Thread Adrian Klaver

On 10/15/21 21:54, Ron wrote:




There is no straight time range, you would have to use tsrange or 
tstzrange. The principle still holds though you can make ranges 
overlap or not depending on '[)' or '[]'.


OP refers to the OVERLAP operator (is it an operator), not the tsrange() 
function.




Your statement was:

"The numeric ranges 0-10 and 10-19 overlap, just as the time ranges 
00:01:00-00:00:02:00 overlaps 00:02:00-00:03:00."


I was just pointing out that is not necessarily true. As was pointed out 
upstream there are good reasons for not having 1:00-2:00 and 2:00-3:00 
overlap.


As David pointed out it is about following the documented behavior. I 
still have to remember, on occasion, that BETWEEN actually includes the 
end points not just points in between them.


--
Adrian Klaver
adrian.kla...@aklaver.com




Re: Wrong sorting on docker image

2021-10-16 Thread Tom Lane
Oleksandr Voytsekhovskyy  writes:
> Starting from version 12.0 official docker image switched from Debian-stretch 
> to Debian-bullseye and from that point we have a huge pain with sorting 
> issues on Russian collation.

Yeah, Debian versions after stretch adopted the significant glibc locale
data changes (sorting rule changes) that are discussed at [1].

>   ея should go before ёа

I'm not qualified to have an opinion on that point, but one would hope
that the glibc people who changed the sorting rules are qualified.
If you disagree, you need to go discuss it with glibc.  Postgres doesn't
define any text sorting rules, we just use what libc or ICU tells us.

Speaking of ICU, if you are using an ICU-enabled Postgres build,
maybe you could find an ICU collation that acts the way you want.
This wouldn't be a perfect solution, because we don't yet have
the ability to set an ICU collation as a database's default.
But you can attach ICU collations to individual text columns,
and maybe that would be a good enough workaround.

regards, tom lane

[1] https://wiki.postgresql.org/wiki/Locale_data_changes




Re: Wrong sorting on docker image

2021-10-16 Thread Peter J. Holzer
On 2021-10-16 13:50:31 +0300, Oleksandr Voytsekhovskyy wrote:
> Starting from version 12.0 official docker image switched from Debian-stretch
> to Debian-bullseye and from that point we have a huge pain with sorting issues
> on Russian collation.
[...]
> Issue:
> 
> postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name;
>  name 
> --
>  ёа
>  ея
> (2 строки)
> 
> 
> 
>   еqя should go before  ёqа

Same with the sort command in the shell.


> postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU";
>  ?column? 
> --
>  f
> (1 строка)
> 
> And should be TRUE here
> 
> Any idea how to fix that?

Since the collation is defined by the OS (or rather its C library)
I think this should be reported to Debian or possibly the glibc
maintainers.

PostgreSQL can also use the ICU locales instead of those provided by the
OS. Have you tried that?

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Wrong sorting on docker image

2021-10-16 Thread Oleksandr Voytsekhovskyy
Greetings

Starting from version 12.0 official docker image switched from Debian-stretch 
to Debian-bullseye and from that point we have a huge pain with sorting issues 
on Russian collation.

Dockerfile:

FROM postgres:14
RUN apt-get clean && apt-get update && apt-get install -y locales
RUN localedef -i ru_RU -c -f UTF-8 -A /usr/share/locale/locale.alias ru_RU.UTF-8
ENV LANG ru_RU.utf8

postgres=# select version();
   version  

-
 PostgreSQL 14.0 (Debian 14.0-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by 
gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit


Issue:

postgres=# SELECT * FROM unnest(ARRAY ['ея', 'ёа']) name ORDER BY name;
 name 
--
 ёа
 ея
(2 строки)



  еqя should go before  ёqа

postgres=# SELECT 'ея' COLLATE "ru_RU" < 'ёа' COLLATE "ru_RU";
 ?column? 
--
 f
(1 строка)

And should be TRUE here

Any idea how to fix that?

We are not able manage this for 3 years already (((




Re: "two time periods with only an endpoint in common do not overlap" ???

2021-10-16 Thread Gavin Flower

On 16/10/21 18:41, David G. Johnston wrote:

On Friday, October 15, 2021, Ron  wrote:


Prima facie, if you were told "numbers in the range 0-10", would
you really think, "ah, they *really* mean 0 through 9"?


I would indeed default to both endpoints of the range being 
inclusive.  I also begin counting at one, not zero.  I’ve long gotten 
past being surprised when computer science and my defaults don’t 
agree.  Choices are made and documented and that works for me.


As for this, documentation I never really gave the wording a second 
thought before, though I can definitely understand the complaint and 
like the somewhat wordier, but less linguistically challenging, 
phrasing the OP suggested (Boundary point, especially by itself, is 
not an improvement).


David J.


The reason arrays generally start at zero and not one, is efficiency.

When indexes are zero based then the displacement in bytes from the 
start address of x[n] is simply:

    startAddress + n * sizeOfElement

If the start of an array had the index of one, then you have subtract 
one each time, so the displacement from the start address of x[n] now 
becomes

    startAddress + (n - 1) * sizeOfElement


Half open intervals make life a lot simpler so it is the natural 
default, to prevent intervals from having any numbers in common.


If you have 3 intervals spanning the range [0, 30), and you are only 
dealing with integers then you can split the range as:

[0, 9]       0 <= x <= 9
[10, 19]  10 <= x <= 19
[20, 29]  10 <= x <= 29

But what if you are dealing with floats? The above arrangement would not 
work, as 9.78 would not be in any interval, so you need half open 
intervals, such as:

[0, 10)      0 <= x < 10
[10, 20)  10 <= x < 20
[20, 30)      10 <= x < 30
So you know what number each interval starts at, and every number in the 
range is covered.



-Gavin