On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote:
> Hi,
>
> +command_ok(
> + [
> + 'initdb', '--no-sync',
> + '--locale-provider=builtin', '-E UTF-8',
> + '--builtin-locale=C.UTF-8', "$tempdir/data8"
> + ],
> + 'locale provider
Hi,
+command_ok(
+ [
+ 'initdb', '--no-sync',
+ '--locale-provider=builtin', '-E UTF-8',
+ '--builtin-locale=C.UTF-8', "$tempdir/data8"
+ ],
+ 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8');
This Sun animal recently
On 01.04.24 21:52, Jeff Davis wrote:
On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:
The patch set v27 is ok with me, modulo (a) discussion about initcap
semantics, and (b) what collation to assign to ucs_basic, which can
be
revisited later.
Attached v28.
The remaining patches are
On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote:
>
> Full vs. simple case mapping is more of a legacy compatibility
> question,
> in my mind. There is some expectation/precedent that C.UTF-8 uses
> simple case mapping, but beyond that, I don't see a reason why
> someone
> would want
On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote:
> The patch set v27 is ok with me, modulo (a) discussion about initcap
> semantics, and (b) what collation to assign to ucs_basic, which can
> be
> revisited later.
I held off on the refactoring patch for lc_{ctype|collate}_is_c().
On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
> provider | isalpha | isdigit
> --+-+-
> ICU | f | t
> glibc | t | f
> builtin | f | f
The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not
Jeff Davis wrote:
> The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
> collation vs '123Abc' in PG_UNICODE_FAST.
>
> The reason for the latter behavior is that the Unicode Default Case
> Conversion algorithm for toTitlecase() advances to the next Cased
> character
On 25.03.24 18:52, Jeff Davis wrote:
OK, I'll propose a "title" or "titlecase" function for 18, along with
"casefold" (which I was already planning to propose).
(Yay, casefold will be useful.)
What do you think about UPPER/LOWER and full case mapping? Should there
be extra arguments for full
On 21.03.24 01:13, Jeff Davis wrote:
The v26 patch was not quite complete, so I didn't commit it yet.
Attached v27-0001 and 0002.
0002 is necessary because otherwise lc_collate_is_c() short-circuits
the version check in pg_newlocale_from_collation(). With 0002, the code
is simpler and all paths
On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote:
> Right. I thought when you said there is an ICU configuration for it,
> that it might be like collation options that you specify in the
> locale
> string. But it appears it is only an internal API setting. So that,
> in
> my mind,
On 22.03.24 18:26, Jeff Davis wrote:
On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote:
I think this might be too big of a compatibility break. So far,
initcap('123abc') has always returned '123abc'. If the new collation
returns '123Abc' now, then that's quite a change. These are not
There is no technical content in this mail, but I'd like to
show appreciation for your work on this. I hope this will
eventually remove one of the great embarrassments when using
PostgreSQL: the dependency on operation system collations.
Yours,
Laurenz Albe
On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote:
> Please look at a Valgrind-detected error caused by the following
> query
> (starting from f69319f2f):
> SELECT lower('Π' COLLATE pg_c_utf8);
Thank you for the report!
Fixed in 503c0ad976.
Valgrind did not detect the problem in my
Hello Jeff,
21.03.2024 03:13, Jeff Davis wrote:
On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote:
* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
Looks ok.
Committed.
Please look at a Valgrind-detected error caused by the following query
(starting from
On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote:
> I think this might be too big of a compatibility break. So far,
> initcap('123abc') has always returned '123abc'. If the new collation
> returns '123Abc' now, then that's quite a change. These are not some
> obscure Unicode special
On 21.03.24 01:13, Jeff Davis wrote:
Are there any test cases that illustrate the word boundary changes in
patch 0005? It might be useful to test those against Oracle as well.
The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8
collation vs '123Abc' in PG_UNICODE_FAST.
The
* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch
This was committed.
* v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch
Looks ok.
* v25-0003-Inline-basic-UTF-8-functions.patch
ok
* v25-0004-Use-version-for-builtin-collations.patch
Not sure about the version
Thomas Munro writes:
> On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote:
This is causing all CI jobs to fail the "compiler warnings" check.
>>> I did run CI before checkin, and it passed:
> Maybe I misunderstood this exchange but ...
> Currently Windows warnings don't make any CI tasks
On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote:
> Jeff Davis writes:
> > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> >> This is causing all CI jobs to fail the "compiler warnings" check.
>
> > I did run CI before checkin, and it passed:
> > https://cirrus-ci.com/build/5382423490330624
Jeff Davis writes:
> On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
>> This is causing all CI jobs to fail the "compiler warnings" check.
> I did run CI before checkin, and it passed:
> https://cirrus-ci.com/build/5382423490330624
Weird, why did it not report with the same level of urgency?
On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote:
> This is causing all CI jobs to fail the "compiler warnings" check.
I did run CI before checkin, and it passed:
https://cirrus-ci.com/build/5382423490330624
If I open up the windows build, I see the warning:
Jeff Davis writes:
> It may be moot soon, but I committed a fix now.
Thanks, but it looks like 846311051 introduced a fresh issue.
MSVC is complaining about
[21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning
C4715: 'builtin_locale_encoding': not all control paths return
On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote:
> Jeff Davis writes:
> > New series attached.
>
> Coverity thinks there's something wrong with builtin_validate_locale,
> and as far as I can tell it's right: the last ereport is unreachable,
> because required_encoding is never changed from its
Jeff Davis writes:
> New series attached.
Coverity thinks there's something wrong with builtin_validate_locale,
and as far as I can tell it's right: the last ereport is unreachable,
because required_encoding is never changed from its initial -1 value.
It looks like there's a chunk of logic
On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote:
> On 14.03.24 09:08, Jeff Davis wrote:
> > 0001 (the C.UTF-8 locale) is also close...
>
> If have tested this against the libc locale C.utf8 that was available
> on
> the OS, and the behavior is consistent.
That was the goal, in spirit.
On 14.03.24 09:08, Jeff Davis wrote:
0001 (the C.UTF-8 locale) is also close. Considering that most of the
infrastructure is already in place, that's not a large patch. You many
have some comments about the way I'm canonicalizing and validating in
initdb -- that could be cleaner, but it feels
On 14.03.24 09:08, Jeff Davis wrote:
On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote:
New series attached. I plan to commit 0001 very soon.
Committed the basic builtin provider, supporting only the "C" locale.
As you were committing this, I had another review of
On 08.03.24 02:00, Jeff Davis wrote:
And here's v22 (I didn't post v21).
I committed Unicode property tables and functions, and the simple case
mapping. I separated out the full case mapping changes (based on
SpecialCasing.txt) into patch 0006.
0002: Basic builtin collation provider that
On 13.02.24 03:01, Jeff Davis wrote:
1. The SQL spec mentions the capitalization of "ß" as "SS"
specifically. Should UCS_BASIC use the unconditional mappings in
SpecialCasing.txt? I already have some code to do that (not posted
yet).
It is my understanding that "correct" Unicode case
On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote:
> Various comments are updated to include the term "character class".
> I
> don't recognize that as an official Unicode term. There are
> categories
> and properties. Let's check this.
It's based on
Review of the v16 patch set:
(Btw., I suppose you started this patch series with 0002 because some
0001 was committed earlier. But I have found this rather confusing. I
think it's ok to renumber from 0001 for each new version.)
* v16-0002-Add-Unicode-property-tables.patch
Various comments
On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote:
> >
> I don't get this argument. Of course, people care about sorting and
> sort order. Whether you consider this part of Unicode or adjacent to
> it, people still want it.
You said that my proposal sends a message that we somehow
On 18.01.24 23:03, Jeff Davis wrote:
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
I think that would be a terrible direction to take, because it would
regress the default sort order from "correct" to "useless".
I don't agree that the current default is "correct". There are a lot
On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote:
> I think that would be a terrible direction to take, because it would
> regress the default sort order from "correct" to "useless".
I don't agree that the current default is "correct". There are a lot of
ways it can be wrong:
* the
Peter Eisentraut wrote:
> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> > annoyance would be that "ORDER BY textcol" would no
> > longer be the
On 12.01.24 03:02, Jeff Davis wrote:
New version attached. Changes:
* Named collation object PG_C_UTF8, which seems like a good idea to
prevent name conflicts with existing collations. The locale name is
still C.UTF-8, which still makes sense to me because it matches the
behavior of the libc
On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote:
> Concerning the target category_test, it produces failures with
> versions of ICU with Unicode < 15. The first one I see with Ubuntu
> 22.04 (ICU 70.1) is:
...
> I find these results interesting because they tell us what contents
> can
Jeff Davis wrote:
> New version attached.
[v16]
Concerning the target category_test, it produces failures with
versions of ICU with Unicode < 15. The first one I see with Ubuntu
22.04 (ICU 70.1) is:
category_test: Postgres Unicode version:15.1
category_test: ICU Unicode
On Fri, Jan 12, 2024 at 01:13:04PM -0500, Robert Haas wrote:
> On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote:
>> ISTM that in general the behavior of old psql vs new server does
>> not weight much against choosing optimal catalog changes.
>
> +1.
+1. There is a good amount of effort put
On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote:
> Another one is that version 12 broke \d in older psql by
> removing pg_class.relhasoids.
>
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.
>
> There's also
On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote:
> ISTM that in general the behavior of old psql vs new server does
> not weight much against choosing optimal catalog changes.
+1.
--
Robert Haas
EDB: http://www.enterprisedb.com
Jeff Davis wrote:
> > Jeremy also raised a problem with old versions of psql connecting to
> > a
> > new server: the \l and \dO won't work. Not sure exactly what to do
> > there, but I could work around it by adding a new field rather than
> > renaming (though that's not ideal).
>
> I
On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote:
> Jeremy also raised a problem with old versions of psql connecting to
> a
> new server: the \l and \dO won't work. Not sure exactly what to do
> there, but I could work around it by adding a new field rather than
> renaming (though that's not
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
>
> =# \l
> ERROR: 42703: column d.datlocale does not exist
> LINE 8: d.datlocale as "Locale",
> ^
> HINT: Perhaps you
On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote:
> $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
>
> The database cluster will be initialized with this locale
> configuration:
> default collation provider: builtin
> default collation locale: C.UTF-8
>
Jeff Davis wrote:
> Attached a more complete version that fixes a few bugs
[v15 patch]
When selecting the builtin provider with initdb, I'm getting the
following setup:
$ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata
The database cluster will be initialized
On 1/9/24 2:31 PM, Jeff Davis wrote:
> On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
>> I think we missed something in psql, pretty sure I applied all the
>> patches but I see this error:
>>
>> =# \l
>> ERROR: 42703: column d.datlocale does not exist
>> LINE 8: d.datlocale as
On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote:
> I agree with merging the threads, even though it makes for a larger
> patch set. It would be great to get a unified "builtin" provider in
> place for the next major.
I believe that's possible and that this proposal is quite close
On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote:
> I think we missed something in psql, pretty sure I applied all the
> patches but I see this error:
>
> =# \l
> ERROR: 42703: column d.datlocale does not exist
> LINE 8: d.datlocale as "Locale",
>
Thank you. I'll fix this in the
On 12/28/23 6:57 PM, Jeff Davis wrote:
> On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote:
> Attached a more complete version that fixes a few bugs, stabilizes the
> tests, and improves the documentation. I optimized the performance, too
> -- now it's beating both libc's "C.utf8" and ICU
On 12/28/23 6:57 PM, Jeff Davis wrote:
>
> Attached a more complete version that fixes a few bugs, stabilizes the
> tests, and improves the documentation. I optimized the performance, too
> -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both
> collation and case mapping
Robert Haas wrote:
> For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8,
> a change to C.utf8 would be a much bigger problem, I would
> think. Their alphabet isn't in code point order, and so things would
> be alphabetized wrongly.
> That might be OK if they don't care
On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote:
> One other thing that comes to mind: how does the parser do case
> folding
> for relation names? Is that using OS-provided libc as of today? Or
> did
> we code it to use ICU if that's the DB default? I'm guessing libc,
> and
> global
On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote:
> found some more. here's my running list of everything user-facing I
> see
> in core PG code so far that might involve case:
>
> * upper/lower/initcap
> * regexp_*() and *_REGEXP()
> * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~*
> * citext +
On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis wrote:
> Those locales all have the same ctype behavior.
Sigh. I keep getting confused about how that works...
--
Robert Haas
EDB: http://www.enterprisedb.com
On 12/20/23 4:04 PM, Jeremy Schneider wrote:
> On 12/20/23 3:47 PM, Jeremy Schneider wrote:
>> On 12/5/23 3:46 PM, Jeff Davis wrote:
>>> CTYPE, which handles character classification and upper/lowercasing
>>> behavior, may be simpler than it first appears. We may be able to get
>>> a net decrease
On 12/20/23 3:47 PM, Jeremy Schneider wrote:
> On 12/5/23 3:46 PM, Jeff Davis wrote:
>> CTYPE, which handles character classification and upper/lowercasing
>> behavior, may be simpler than it first appears. We may be able to get
>> a net decrease in complexity by just building in most (or perhaps
On 12/5/23 3:46 PM, Jeff Davis wrote:
> CTYPE, which handles character classification and upper/lowercasing
> behavior, may be simpler than it first appears. We may be able to get
> a net decrease in complexity by just building in most (or perhaps all)
> of the functionality.
>
> === Character
On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote:
> This makes sense to me, too, but it feels like it might work out
> better for speakers of English than for speakers of other languages.
There's very little in the way of locale-specific tailoring for ctype
behaviors in ICU or glibc -- only
On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis wrote:
> On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
> > If the Postgres default was bytewise sorting+locale-agnostic
> > ctype functions directly derived from Unicode data files,
> > as opposed to libc/$LANG at initdb time, the main
> >
On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote:
> If the Postgres default was bytewise sorting+locale-agnostic
> ctype functions directly derived from Unicode data files,
> as opposed to libc/$LANG at initdb time, the main
> annoyance would be that "ORDER BY textcol" would no
> longer be
Jeff Davis wrote:
> But there are a lot of users for whom neither of those things are true,
> and it makes zero sense to order all of the text indexes in the
> database according to any one particular locale. I think these users
> would prioritize stability and performance for the
On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote:
> FWIW, the idea that we're going to develop a built-in provider seems
> to be solid, for the reasons Jeff mentions: it can be stable, and
> under our control. But it seems like we might need built-in providers
> for everything rather than just
On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis wrote:
> The whole concept of "providers" is that they aren't consistent with
> each other. ICU, libc, and the builtin provider will all be based on
> different versions of Unicode. That's by design.
>
> The built-in provider will be a bit better in the
On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote:
> Looking closer, patches 3 and 4 look like an incremental extension of
> this earlier idea;
Yes, it's essentially the same thing extended to a few more files. I
don't know if "incremental" is the right word though; this is a
substantial
On 12/13/23 5:28 AM, Jeff Davis wrote:
> On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
>> My biggest concern is around maintenance. Every year Unicode is
>> assigning new characters to existing code points, and those existing
>> code points can of course already be stored in old
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> In particular "el" (modern greek) has case mapping rules that
> ICU seems to implement, but "el" is missing from the list
> ("lt", "tr", and "az") you identified.
I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the
ctype
On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote:
> But there are CLDR mappings on top of that.
I see, thank you.
Would it still be called "full" case mapping to only use the mappings
in SpecialCasing.txt? And would that be useful?
Regards,
Jeff Davis
Jeff Davis wrote:
> While "full" case mapping sounds more complex, there are actually
> very few cases to consider and they are covered in another (small)
> data file. That data file covers ~100 code points that convert to
> multiple code points when the case changes (e.g. "ß" -> "SS"), 7
On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote:
> My biggest concern is around maintenance. Every year Unicode is
> assigning new characters to existing code points, and those existing
> code points can of course already be stored in old databases before
> libs
> are updated.
Is the
On 12/5/23 3:46 PM, Jeff Davis wrote:
> === Character Classification ===
>
> Character classification is used for regexes, e.g. whether a character
> is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]"
> class. Unicode defines what character properties map into these
> classes in TR #18 [1],
71 matches
Mail list logo