Re: insensitive collations

2021-04-03 Thread Daniel Verite
Jim Finnerty wrote: > SET client_encoding = WIN1252; > [...] > postgres=# SELECT * FROM locations WHERE location LIKE 'Franche-Comt__'; -- > the wildcard is applied byte by byte instead of character by character, so > the 2-byte accented character is matched only by 2 '_'s >location

Re: insensitive collations

2021-03-25 Thread Daniel Verite
Jim Finnerty wrote: > Currently nondeterministic collations are disabled at the database level. Deterministic ICU collations are also disabled. > The cited reason was because of the lack of LIKE support and because certain > catalog views use LIKE. But the catalogs shouldn't use the d

Re: insensitive collations

2021-03-25 Thread Daniel Verite
Jim Finnerty wrote: > For a UTF8 encoded, case-insensitive (nondeterministic), accent-sensitive > ICU > collation, a LIKE predicate can be used with a small transformation of the > predicate, and the pattern can contain multi-byte characters: > > from: > > SELECT * FROM locations WHERE l

Re: insensitive collations

2019-03-22 Thread Peter Eisentraut
On 2019-03-18 00:19, Peter Eisentraut wrote: > On 2019-03-11 21:36, Peter Eisentraut wrote: >> Patches here. This will allow all the existing collation customization >> options as well as the ones being proposed in this thread to work in >> older ICU versions. > > This has been committed, and her

Re: insensitive collations

2019-03-11 Thread Peter Eisentraut
On 2019-03-08 11:09, Peter Eisentraut wrote: > On 2019-03-07 20:04, Daniel Verite wrote: >> With previous versions, we'd need to call ucol_setAttribute(), >> with the attributes and values defined here: >> http://icu-project.org/apiref/icu4c/ucol_8h.html >> for instance to get colStrength=secondary

Re: insensitive collations

2019-03-08 Thread Peter Eisentraut
On 2019-03-07 20:04, Daniel Verite wrote: > With previous versions, we'd need to call ucol_setAttribute(), > with the attributes and values defined here: > http://icu-project.org/apiref/icu4c/ucol_8h.html > for instance to get colStrength=secondary: > ucol_setAttribute(coll, UCOL_STRENGTH , UCOL_

Re: insensitive collations

2019-03-07 Thread Daniel Verite
Peter Eisentraut wrote: > The problem is not the syntax but that the older ICU versions don't > support the *functionality* of ks-level2 or colStrength=secondary. If > you try it, you will simply get a normal case-sensitive behavior. My bad, I see now that the "old locale extension synta

Re: insensitive collations

2019-03-06 Thread Peter Eisentraut
On 2019-03-05 18:48, Daniel Verite wrote: >> Older ICU versions (<54) don't support all the locale customization >> options, so many of my new tests in collate.icu.utf8.sql will fail on >> older systems. What should we do about that? Have another extra test file? > Maybe stick to the old-style sy

Re: insensitive collations

2019-03-05 Thread Daniel Verite
Peter Eisentraut wrote: > Older ICU versions (<54) don't support all the locale customization > options, so many of my new tests in collate.icu.utf8.sql will fail on > older systems. What should we do about that? Have another extra test file? Maybe stick to the old-style syntax for the

Re: insensitive collations

2019-03-04 Thread Daniel Verite
Peter Eisentraut wrote: [v7-0001-Collations-with-nondeterministic-comparison.patch] +GenericMatchText(const char *s, int slen, const char *p, int plen, Oid collation) { + if (collation && !lc_ctype_is_c(collation) && collation != DEFAULT_COLLATION_OID) + { +pg_locale_tlocale =

Re: insensitive collations

2019-02-21 Thread Peter Eisentraut
On 2019-02-21 03:17, Peter Geoghegan wrote: > I wonder if it would be better to break this into distinct commits? I thought about that. Especially the planner/executor changes could be done separately, sort of as a way to address the thread "ExecBuildGroupingEqual versus collations". But I'm not

Re: insensitive collations

2019-02-20 Thread Peter Geoghegan
On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut wrote: > Another patch to address merge conflicts. Some remarks on this: * Your draft commit message says: > This patch makes changes in three areas: > > - CREATE COLLATION DDL changes and system catalog changes to support > this new flag. > >

Re: insensitive collations

2019-01-30 Thread Daniel Verite
Peter Eisentraut wrote: > Another patch. +ks key), in order for such such collations to act in a s/such such/such/ + +The pattern matching operators of all three kinds do not support +nondeterministic collations. If required, apply a different collation to +the expr

Re: insensitive collations

2019-01-25 Thread Peter Eisentraut
On 16/01/2019 21:50, Peter Eisentraut wrote: > On 16/01/2019 14:20, Daniel Verite wrote: >> I've found another issue with aggregates over distinct: >> the deduplication seems to ignore the collation. > > I have a fix for that. I'll send it with the next update. Another patch. This fixes your is

Re: insensitive collations

2019-01-16 Thread Peter Eisentraut
On 16/01/2019 14:20, Daniel Verite wrote: > I've found another issue with aggregates over distinct: > the deduplication seems to ignore the collation. I have a fix for that. I'll send it with the next update. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 2

Re: insensitive collations

2019-01-16 Thread Peter Eisentraut
On 14/01/2019 15:37, Andreas Karlsson wrote: >> Nondeterministic collations do address this by allowing canonically >> equivalent code point sequences to compare as equal. You still need a >> collation implementation that actually does compare them as equal; ICU >> does this, glibc does not AFAICT

Re: insensitive collations

2019-01-16 Thread Daniel Verite
Peter Eisentraut wrote: > > On a table with pre-existing contents, the creation of a unique index > > does not seem to detect the duplicates that are equal per the > > collation and different binary-wise. > > Fixed in the attached updated patch. Check. I've found another issue with aggre

Re: insensitive collations

2019-01-15 Thread Peter Eisentraut
On 14/01/2019 13:23, Daniel Verite wrote: > On a table with pre-existing contents, the creation of a unique index > does not seem to detect the duplicates that are equal per the > collation and different binary-wise. Fixed in the attached updated patch. -- Peter Eisentraut http://ww

Re: insensitive collations

2019-01-14 Thread Daniel Verite
Andreas Karlsson wrote: > > Nondeterministic collations do address this by allowing canonically > > equivalent code point sequences to compare as equal. You still need a > > collation implementation that actually does compare them as equal; ICU > > does this, glibc does not AFAICT. > > A

Re: insensitive collations

2019-01-14 Thread Andreas Karlsson
On 1/10/19 8:44 AM, Peter Eisentraut wrote: On 09/01/2019 19:49, Andreas Karlsson wrote: Maybe this is orthogonal and best handled elsewhere but have you when working with string equality given unicode normalization forms[1] any thought? Nondeterministic collations do address this by allowing

Re: insensitive collations

2019-01-14 Thread Daniel Verite
Peter Eisentraut wrote: > Here is an updated patch. On a table with pre-existing contents, the creation of a unique index does not seem to detect the duplicates that are equal per the collation and different binary-wise. postgres=# \d test3ci Table "public.test3ci" Colum

Re: insensitive collations

2019-01-09 Thread Peter Eisentraut
On 09/01/2019 22:01, Daniel Verite wrote: >> I don't see anything wrong here. The collation says that both values >> are equal, so which one is returned is implementation-dependent. > Is it, but it's impractical if the product of seemingly the same GROUP BY > flip-flops between its different valid

Re: insensitive collations

2019-01-09 Thread Peter Eisentraut
On 09/01/2019 19:49, Andreas Karlsson wrote: > On 12/28/18 9:55 AM, Peter Eisentraut wrote: >> Here is an updated patch. >> >> I have updated the naming to "deterministic", as discussed. > > Maybe this is orthogonal and best handled elsewhere but have you when > working with string equality given

Re: insensitive collations

2019-01-09 Thread Daniel Verite
Peter Eisentraut wrote: > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > > group by 1 order by 1 ; > > n > > - > > ete > > (1 row) > > > > =# select n from (values ('été' collate "myfr"), ('ete')) x(n) > > group by 1 order by 1 desc; > > n > > - >

Re: insensitive collations

2019-01-09 Thread Andreas Karlsson
On 12/28/18 9:55 AM, Peter Eisentraut wrote: Here is an updated patch. I have updated the naming to "deterministic", as discussed. Maybe this is orthogonal and best handled elsewhere but have you when working with string equality given unicode normalization forms[1] any thought? I feel there

Re: insensitive collations

2019-01-05 Thread Peter Eisentraut
On 04/01/2019 17:05, Daniel Verite wrote: > When using GROUP BY and ORDER BY on a field with a non-deterministic > collation, this pops out: > > CREATE COLLATION myfr (locale='fr-u-ks-level1', > provider='icu', deterministic=false); > > =# select n from (values ('été' collate "myfr"), ('ete'))

Re: insensitive collations

2019-01-04 Thread Daniel Verite
Peter Eisentraut wrote: > Here is an updated patch. When using GROUP BY and ORDER BY on a field with a non-deterministic collation, this pops out: CREATE COLLATION myfr (locale='fr-u-ks-level1', provider='icu', deterministic=false); =# select n from (values ('été' collate "myfr"), ('

Re: insensitive collations

2018-12-28 Thread Peter Eisentraut
Here is an updated patch. I have updated the naming to "deterministic", as discussed. I have fixed up support for the "name" type, added foreign key support, psql, pg_dump support, more tests. There are a couple of TODOs in bpchar support that I need to look into a bit more. But other than that

Re: insensitive collations

2018-12-20 Thread Daniel Verite
Tom Lane wrote: > I don't really find it "natural" for equality to consider obviously > distinct values to be equal. According to https://www.merriam-webster.com/dictionary/natural "natural" has no less than 15 meanings. The first in the list is "based on an inherent sense of right and

Re: insensitive collations

2018-12-19 Thread Tom Lane
Peter Geoghegan writes: > The Unicode consortium calls our current behavior within comparisons > "deterministic comparisons" -- it's something they're not so keen on: > https://unicode.org/reports/tr10/#Deterministic_Comparison > I suggest using their terminology for our current behavior. Hm, it'

Re: insensitive collations

2018-12-19 Thread Peter Geoghegan
On Wed, Dec 19, 2018 at 6:36 AM Tom Lane wrote: > I don't really find it "natural" for equality to consider obviously > distinct values to be equal. The Unicode consortium calls our current behavior within comparisons "deterministic comparisons" -- it's something they're not so keen on: https://

Re: insensitive collations

2018-12-19 Thread Tom Lane
"Daniel Verite" writes: > To me it seems more natural to find a name for the other behavior, the > one that consists of overwriting the locale-sensitive equality with > the result of the byte-wise comparison. AFAIU the flag is meant > to say: "don't do that". > Some ideas that come to mind: > as

Re: insensitive collations

2018-12-19 Thread Daniel Verite
Peter Eisentraut wrote: > So this doesn't actually make the collation case-insensitive or > anything, it just allows a library-provided collation that is, say, > case-insensitive to actually work that way. That's great news! > So maybe "insensitive" isn't the right name for this flag, bu

insensitive collations

2018-12-18 Thread Peter Eisentraut
With various patches and discussions around collations going on, I figured I'd send in my in-progress patch for insensitive collations. This adds a flag "insensitive" to collations. Such a collation disables various optimizations that assume that strings are equal only if the