subject:"Re\: Built\-in CTYPE provider"

Re: Built-in CTYPE provider

2024-04-04 Thread Jeff Davis

On Fri, 2024-04-05 at 11:22 +1300, Thomas Munro wrote: > Hi, > > +command_ok( > + [ > + 'initdb', '--no-sync', > + '--locale-provider=builtin', '-E UTF-8', > + '--builtin-locale=C.UTF-8', "$tempdir/data8" > + ], > + 'locale provider

Re: Built-in CTYPE provider

2024-04-04 Thread Thomas Munro

Hi, +command_ok( + [ + 'initdb', '--no-sync', + '--locale-provider=builtin', '-E UTF-8', + '--builtin-locale=C.UTF-8', "$tempdir/data8" + ], + 'locale provider builtin with -E UTF-8 --builtin-locale=C.UTF-8'); This Sun animal recently

Re: Built-in CTYPE provider

2024-04-04 Thread Peter Eisentraut

On 01.04.24 21:52, Jeff Davis wrote: On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: The patch set v27 is ok with me, modulo (a) discussion about initcap semantics, and (b) what collation to assign to ucs_basic, which can be revisited later. Attached v28. The remaining patches are

Re: Built-in CTYPE provider

2024-04-03 Thread Jeff Davis

On Tue, 2024-03-26 at 08:14 +0100, Peter Eisentraut wrote: > > Full vs. simple case mapping is more of a legacy compatibility > question, > in my mind. There is some expectation/precedent that C.UTF-8 uses > simple case mapping, but beyond that, I don't see a reason why > someone > would want

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Tue, 2024-03-26 at 08:04 +0100, Peter Eisentraut wrote: > The patch set v27 is ok with me, modulo (a) discussion about initcap > semantics, and (b) what collation to assign to ucs_basic, which can > be > revisited later. I held off on the refactoring patch for lc_{ctype|collate}_is_c().

Re: Built-in CTYPE provider

2024-03-27 Thread Jeff Davis

On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote: > provider | isalpha | isdigit > --+-+- > ICU | f | t > glibc | t | f > builtin | f | f The "ICU" above is really the behvior of the Postgres ICU provider as we implemented it, it's not

Re: Built-in CTYPE provider

2024-03-27 Thread Daniel Verite

Jeff Davis wrote: > The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 > collation vs '123Abc' in PG_UNICODE_FAST. > > The reason for the latter behavior is that the Unicode Default Case > Conversion algorithm for toTitlecase() advances to the next Cased > character

Re: Built-in CTYPE provider

2024-03-26 Thread Peter Eisentraut

On 25.03.24 18:52, Jeff Davis wrote: OK, I'll propose a "title" or "titlecase" function for 18, along with "casefold" (which I was already planning to propose). (Yay, casefold will be useful.) What do you think about UPPER/LOWER and full case mapping? Should there be extra arguments for full

Re: Built-in CTYPE provider

2024-03-26 Thread Peter Eisentraut

On 21.03.24 01:13, Jeff Davis wrote: The v26 patch was not quite complete, so I didn't commit it yet. Attached v27-0001 and 0002. 0002 is necessary because otherwise lc_collate_is_c() short-circuits the version check in pg_newlocale_from_collation(). With 0002, the code is simpler and all paths

Re: Built-in CTYPE provider

2024-03-25 Thread Jeff Davis

On Mon, 2024-03-25 at 08:29 +0100, Peter Eisentraut wrote: > Right. I thought when you said there is an ICU configuration for it, > that it might be like collation options that you specify in the > locale > string. But it appears it is only an internal API setting. So that, > in > my mind,

Re: Built-in CTYPE provider

2024-03-25 Thread Peter Eisentraut

On 22.03.24 18:26, Jeff Davis wrote: On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: I think this might be too big of a compatibility break. So far, initcap('123abc') has always returned '123abc'. If the new collation returns '123Abc' now, then that's quite a change. These are not

Re: Built-in CTYPE provider

2024-03-25 Thread Laurenz Albe

There is no technical content in this mail, but I'd like to show appreciation for your work on this. I hope this will eventually remove one of the great embarrassments when using PostgreSQL: the dependency on operation system collations. Yours, Laurenz Albe

Re: Built-in CTYPE provider

2024-03-24 Thread Jeff Davis

On Sun, 2024-03-24 at 14:00 +0300, Alexander Lakhin wrote: > Please look at a Valgrind-detected error caused by the following > query > (starting from f69319f2f): > SELECT lower('Π' COLLATE pg_c_utf8); Thank you for the report! Fixed in 503c0ad976. Valgrind did not detect the problem in my

Re: Built-in CTYPE provider

2024-03-24 Thread Alexander Lakhin

Hello Jeff, 21.03.2024 03:13, Jeff Davis wrote: On Tue, 2024-03-19 at 13:41 +0100, Peter Eisentraut wrote: * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. Committed. Please look at a Valgrind-detected error caused by the following query (starting from

Re: Built-in CTYPE provider

2024-03-22 Thread Jeff Davis

On Fri, 2024-03-22 at 15:51 +0100, Peter Eisentraut wrote: > I think this might be too big of a compatibility break. So far, > initcap('123abc') has always returned '123abc'. If the new collation > returns '123Abc' now, then that's quite a change. These are not some > obscure Unicode special

Re: Built-in CTYPE provider

2024-03-22 Thread Peter Eisentraut

On 21.03.24 01:13, Jeff Davis wrote: Are there any test cases that illustrate the word boundary changes in patch 0005? It might be useful to test those against Oracle as well. The tests include initcap('123abc') which is '123abc' in the PG_C_UTF8 collation vs '123Abc' in PG_UNICODE_FAST. The

Re: Built-in CTYPE provider

2024-03-19 Thread Peter Eisentraut

* v25-0001-Address-more-review-comments-on-commit-2d819a08a.patch This was committed. * v25-0002-Support-C.UTF-8-locale-in-the-new-builtin-collat.patch Looks ok. * v25-0003-Inline-basic-UTF-8-functions.patch ok * v25-0004-Use-version-for-builtin-collations.patch Not sure about the version

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Thomas Munro writes: > On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote: This is causing all CI jobs to fail the "compiler warnings" check. >>> I did run CI before checkin, and it passed: > Maybe I misunderstood this exchange but ... > Currently Windows warnings don't make any CI tasks

Re: Built-in CTYPE provider

2024-03-18 Thread Thomas Munro

On Tue, Mar 19, 2024 at 11:55 AM Tom Lane wrote: > Jeff Davis writes: > > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > >> This is causing all CI jobs to fail the "compiler warnings" check. > > > I did run CI before checkin, and it passed: > > https://cirrus-ci.com/build/5382423490330624

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Jeff Davis writes: > On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: >> This is causing all CI jobs to fail the "compiler warnings" check. > I did run CI before checkin, and it passed: > https://cirrus-ci.com/build/5382423490330624 Weird, why did it not report with the same level of urgency?

Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis

On Mon, 2024-03-18 at 18:04 -0400, Tom Lane wrote: > This is causing all CI jobs to fail the "compiler warnings" check. I did run CI before checkin, and it passed: https://cirrus-ci.com/build/5382423490330624 If I open up the windows build, I see the warning:

Re: Built-in CTYPE provider

2024-03-18 Thread Tom Lane

Jeff Davis writes: > It may be moot soon, but I committed a fix now. Thanks, but it looks like 846311051 introduced a fresh issue. MSVC is complaining about [21:37:15.349] c:\cirrus\src\backend\utils\adt\pg_locale.c(2515) : warning C4715: 'builtin_locale_encoding': not all control paths return

Re: Built-in CTYPE provider

2024-03-18 Thread Jeff Davis

On Sun, 2024-03-17 at 17:46 -0400, Tom Lane wrote: > Jeff Davis writes: > > New series attached. > > Coverity thinks there's something wrong with builtin_validate_locale, > and as far as I can tell it's right: the last ereport is unreachable, > because required_encoding is never changed from its

Re: Built-in CTYPE provider

2024-03-17 Thread Tom Lane

Jeff Davis writes: > New series attached. Coverity thinks there's something wrong with builtin_validate_locale, and as far as I can tell it's right: the last ereport is unreachable, because required_encoding is never changed from its initial -1 value. It looks like there's a chunk of logic

Re: Built-in CTYPE provider

2024-03-14 Thread Jeff Davis

On Thu, 2024-03-14 at 15:38 +0100, Peter Eisentraut wrote: > On 14.03.24 09:08, Jeff Davis wrote: > > 0001 (the C.UTF-8 locale) is also close... > > If have tested this against the libc locale C.utf8 that was available > on > the OS, and the behavior is consistent. That was the goal, in spirit.

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut

On 14.03.24 09:08, Jeff Davis wrote: 0001 (the C.UTF-8 locale) is also close. Considering that most of the infrastructure is already in place, that's not a large patch. You many have some comments about the way I'm canonicalizing and validating in initdb -- that could be cleaner, but it feels

Re: Built-in CTYPE provider

2024-03-14 Thread Peter Eisentraut

On 14.03.24 09:08, Jeff Davis wrote: On Wed, 2024-03-13 at 00:44 -0700, Jeff Davis wrote: New series attached. I plan to commit 0001 very soon. Committed the basic builtin provider, supporting only the "C" locale. As you were committing this, I had another review of

Re: Built-in CTYPE provider

2024-03-12 Thread Peter Eisentraut

On 08.03.24 02:00, Jeff Davis wrote: And here's v22 (I didn't post v21). I committed Unicode property tables and functions, and the simple case mapping. I separated out the full case mapping changes (based on SpecialCasing.txt) into patch 0006. 0002: Basic builtin collation provider that

Re: Built-in CTYPE provider

2024-02-12 Thread Peter Eisentraut

On 13.02.24 03:01, Jeff Davis wrote: 1. The SQL spec mentions the capitalization of "ß" as "SS" specifically. Should UCS_BASIC use the unconditional mappings in SpecialCasing.txt? I already have some code to do that (not posted yet). It is my understanding that "correct" Unicode case

Re: Built-in CTYPE provider

2024-02-12 Thread Jeff Davis

On Wed, 2024-02-07 at 10:53 +0100, Peter Eisentraut wrote: > Various comments are updated to include the term "character class". > I > don't recognize that as an official Unicode term. There are > categories > and properties. Let's check this. It's based on

Re: Built-in CTYPE provider

2024-02-07 Thread Peter Eisentraut

Review of the v16 patch set: (Btw., I suppose you started this patch series with 0002 because some 0001 was committed earlier. But I have found this rather confusing. I think it's ok to renumber from 0001 for each new version.) * v16-0002-Add-Unicode-property-tables.patch Various comments

Re: Built-in CTYPE provider

2024-01-22 Thread Jeff Davis

On Mon, 2024-01-22 at 19:49 +0100, Peter Eisentraut wrote: > > > I don't get this argument. Of course, people care about sorting and > sort order. Whether you consider this part of Unicode or adjacent to > it, people still want it. You said that my proposal sends a message that we somehow

Re: Built-in CTYPE provider

2024-01-22 Thread Peter Eisentraut

On 18.01.24 23:03, Jeff Davis wrote: On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: I think that would be a terrible direction to take, because it would regress the default sort order from "correct" to "useless". I don't agree that the current default is "correct". There are a lot

Re: Built-in CTYPE provider

2024-01-18 Thread Jeff Davis

On Thu, 2024-01-18 at 13:53 +0100, Peter Eisentraut wrote: > I think that would be a terrible direction to take, because it would > regress the default sort order from "correct" to "useless". I don't agree that the current default is "correct". There are a lot of ways it can be wrong: * the

Re: Built-in CTYPE provider

2024-01-18 Thread Daniel Verite

Peter Eisentraut wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > > annoyance would be that "ORDER BY textcol" would no > > longer be the

Re: Built-in CTYPE provider

2024-01-18 Thread Peter Eisentraut

On 12.01.24 03:02, Jeff Davis wrote: New version attached. Changes: * Named collation object PG_C_UTF8, which seems like a good idea to prevent name conflicts with existing collations. The locale name is still C.UTF-8, which still makes sense to me because it matches the behavior of the libc

Re: Built-in CTYPE provider

2024-01-15 Thread Jeff Davis

On Mon, 2024-01-15 at 15:30 +0100, Daniel Verite wrote: > Concerning the target category_test, it produces failures with > versions of ICU with Unicode < 15. The first one I see with Ubuntu > 22.04 (ICU 70.1) is: ... > I find these results interesting because they tell us what contents > can

Re: Built-in CTYPE provider

2024-01-15 Thread Daniel Verite

Jeff Davis wrote: > New version attached. [v16] Concerning the target category_test, it produces failures with versions of ICU with Unicode < 15. The first one I see with Ubuntu 22.04 (ICU 70.1) is: category_test: Postgres Unicode version:15.1 category_test: ICU Unicode

Re: Built-in CTYPE provider

2024-01-14 Thread Michael Paquier

On Fri, Jan 12, 2024 at 01:13:04PM -0500, Robert Haas wrote: > On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote: >> ISTM that in general the behavior of old psql vs new server does >> not weight much against choosing optimal catalog changes. > > +1. +1. There is a good amount of effort put

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Fri, 2024-01-12 at 19:00 +0100, Daniel Verite wrote: > Another one is that version 12 broke \d in older psql by > removing pg_class.relhasoids. > > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. > > There's also

Re: Built-in CTYPE provider

2024-01-12 Thread Robert Haas

On Fri, Jan 12, 2024 at 1:00 PM Daniel Verite wrote: > ISTM that in general the behavior of old psql vs new server does > not weight much against choosing optimal catalog changes. +1. -- Robert Haas EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2024-01-12 Thread Daniel Verite

Jeff Davis wrote: > > Jeremy also raised a problem with old versions of psql connecting to > > a > > new server: the \l and \dO won't work. Not sure exactly what to do > > there, but I could work around it by adding a new field rather than > > renaming (though that's not ideal). > > I

Re: Built-in CTYPE provider

2024-01-12 Thread Jeff Davis

On Thu, 2024-01-11 at 18:02 -0800, Jeff Davis wrote: > Jeremy also raised a problem with old versions of psql connecting to > a > new server: the \l and \dO won't work. Not sure exactly what to do > there, but I could work around it by adding a new field rather than > renaming (though that's not

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > ^ > HINT: Perhaps you

Re: Built-in CTYPE provider

2024-01-11 Thread Jeff Davis

On Wed, 2024-01-10 at 23:56 +0100, Daniel Verite wrote: > $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata > > The database cluster will be initialized with this locale > configuration: > default collation provider: builtin > default collation locale: C.UTF-8 >

Re: Built-in CTYPE provider

2024-01-10 Thread Daniel Verite

Jeff Davis wrote: > Attached a more complete version that fixes a few bugs [v15 patch] When selecting the builtin provider with initdb, I'm getting the following setup: $ bin/initdb --locale=C.UTF-8 --locale-provider=builtin -D/tmp/pgdata The database cluster will be initialized

Re: Built-in CTYPE provider

2024-01-09 Thread Jeremy Schneider

On 1/9/24 2:31 PM, Jeff Davis wrote: > On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: >> I think we missed something in psql, pretty sure I applied all the >> patches but I see this error: >> >> =# \l >> ERROR: 42703: column d.datlocale does not exist >> LINE 8: d.datlocale as

Re: Built-in CTYPE provider

2024-01-09 Thread Jeff Davis

On Mon, 2024-01-08 at 17:17 -0800, Jeremy Schneider wrote: > I agree with merging the threads, even though it makes for a larger > patch set. It would be great to get a unified "builtin" provider in > place for the next major. I believe that's possible and that this proposal is quite close

Re: Built-in CTYPE provider

2024-01-09 Thread Jeff Davis

On Tue, 2024-01-09 at 14:17 -0800, Jeremy Schneider wrote: > I think we missed something in psql, pretty sure I applied all the > patches but I see this error: > > =# \l > ERROR: 42703: column d.datlocale does not exist > LINE 8: d.datlocale as "Locale", > Thank you. I'll fix this in the

Re: Built-in CTYPE provider

2024-01-09 Thread Jeremy Schneider

On 12/28/23 6:57 PM, Jeff Davis wrote: > On Wed, 2023-12-27 at 17:26 -0800, Jeff Davis wrote: > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU

Re: Built-in CTYPE provider

2024-01-08 Thread Jeremy Schneider

On 12/28/23 6:57 PM, Jeff Davis wrote: > > Attached a more complete version that fixes a few bugs, stabilizes the > tests, and improves the documentation. I optimized the performance, too > -- now it's beating both libc's "C.utf8" and ICU "en-US-x-icu" for both > collation and case mapping

Re: Built-in CTYPE provider

2023-12-22 Thread Daniel Verite

Robert Haas wrote: > For someone who is currently defaulting to es_ES.utf8 or fr_FR.utf8, > a change to C.utf8 would be a much bigger problem, I would > think. Their alphabet isn't in code point order, and so things would > be alphabetized wrongly. > That might be OK if they don't care

Re: Built-in CTYPE provider

2023-12-21 Thread Jeff Davis

On Wed, 2023-12-20 at 15:47 -0800, Jeremy Schneider wrote: > One other thing that comes to mind: how does the parser do case > folding > for relation names? Is that using OS-provided libc as of today? Or > did > we code it to use ICU if that's the DB default? I'm guessing libc, > and > global

Re: Built-in CTYPE provider

2023-12-21 Thread Jeff Davis

On Wed, 2023-12-20 at 16:29 -0800, Jeremy Schneider wrote: > found some more. here's my running list of everything user-facing I > see > in core PG code so far that might involve case: > > * upper/lower/initcap > * regexp_*() and *_REGEXP() > * ILIKE, operators ~* !~* ~~ !~~ ~~* !~~* > * citext +

Re: Built-in CTYPE provider

2023-12-20 Thread Robert Haas

On Wed, Dec 20, 2023 at 5:57 PM Jeff Davis wrote: > Those locales all have the same ctype behavior. Sigh. I keep getting confused about how that works... -- Robert Haas EDB: http://www.enterprisedb.com

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/20/23 4:04 PM, Jeremy Schneider wrote: > On 12/20/23 3:47 PM, Jeremy Schneider wrote: >> On 12/5/23 3:46 PM, Jeff Davis wrote: >>> CTYPE, which handles character classification and upper/lowercasing >>> behavior, may be simpler than it first appears. We may be able to get >>> a net decrease

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/20/23 3:47 PM, Jeremy Schneider wrote: > On 12/5/23 3:46 PM, Jeff Davis wrote: >> CTYPE, which handles character classification and upper/lowercasing >> behavior, may be simpler than it first appears. We may be able to get >> a net decrease in complexity by just building in most (or perhaps

Re: Built-in CTYPE provider

2023-12-20 Thread Jeremy Schneider

On 12/5/23 3:46 PM, Jeff Davis wrote: > CTYPE, which handles character classification and upper/lowercasing > behavior, may be simpler than it first appears. We may be able to get > a net decrease in complexity by just building in most (or perhaps all) > of the functionality. > > === Character

Re: Built-in CTYPE provider

2023-12-20 Thread Jeff Davis

On Wed, 2023-12-20 at 14:24 -0500, Robert Haas wrote: > This makes sense to me, too, but it feels like it might work out > better for speakers of English than for speakers of other languages. There's very little in the way of locale-specific tailoring for ctype behaviors in ICU or glibc -- only

Re: Built-in CTYPE provider

2023-12-20 Thread Robert Haas

On Wed, Dec 20, 2023 at 2:13 PM Jeff Davis wrote: > On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > > If the Postgres default was bytewise sorting+locale-agnostic > > ctype functions directly derived from Unicode data files, > > as opposed to libc/$LANG at initdb time, the main > >

Re: Built-in CTYPE provider

2023-12-20 Thread Jeff Davis

On Wed, 2023-12-20 at 13:49 +0100, Daniel Verite wrote: > If the Postgres default was bytewise sorting+locale-agnostic > ctype functions directly derived from Unicode data files, > as opposed to libc/$LANG at initdb time, the main > annoyance would be that "ORDER BY textcol" would no > longer be

Re: Built-in CTYPE provider

2023-12-20 Thread Daniel Verite

Jeff Davis wrote: > But there are a lot of users for whom neither of those things are true, > and it makes zero sense to order all of the text indexes in the > database according to any one particular locale. I think these users > would prioritize stability and performance for the

Re: Built-in CTYPE provider

2023-12-19 Thread Jeff Davis

On Tue, 2023-12-19 at 15:59 -0500, Robert Haas wrote: > FWIW, the idea that we're going to develop a built-in provider seems > to be solid, for the reasons Jeff mentions: it can be stable, and > under our control. But it seems like we might need built-in providers > for everything rather than just

Re: Built-in CTYPE provider

2023-12-19 Thread Robert Haas

On Mon, Dec 18, 2023 at 2:46 PM Jeff Davis wrote: > The whole concept of "providers" is that they aren't consistent with > each other. ICU, libc, and the builtin provider will all be based on > different versions of Unicode. That's by design. > > The built-in provider will be a bit better in the

Re: Built-in CTYPE provider

2023-12-18 Thread Jeff Davis

On Fri, 2023-12-15 at 16:30 -0800, Jeremy Schneider wrote: > Looking closer, patches 3 and 4 look like an incremental extension of > this earlier idea; Yes, it's essentially the same thing extended to a few more files. I don't know if "incremental" is the right word though; this is a substantial

Re: Built-in CTYPE provider

2023-12-15 Thread Jeremy Schneider

On 12/13/23 5:28 AM, Jeff Davis wrote: > On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: >> My biggest concern is around maintenance. Every year Unicode is >> assigning new characters to existing code points, and those existing >> code points can of course already be stored in old

Re: Built-in CTYPE provider

2023-12-14 Thread Jeff Davis

On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > In particular "el" (modern greek) has case mapping rules that > ICU seems to implement, but "el" is missing from the list > ("lt", "tr", and "az") you identified. I compared with glibc el_GR.UTF-8 and el_CY.UTF-8 locales, and the ctype

Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis

On Wed, 2023-12-13 at 16:34 +0100, Daniel Verite wrote: > But there are CLDR mappings on top of that. I see, thank you. Would it still be called "full" case mapping to only use the mappings in SpecialCasing.txt? And would that be useful? Regards, Jeff Davis

Re: Built-in CTYPE provider

2023-12-13 Thread Daniel Verite

Jeff Davis wrote: > While "full" case mapping sounds more complex, there are actually > very few cases to consider and they are covered in another (small) > data file. That data file covers ~100 code points that convert to > multiple code points when the case changes (e.g. "ß" -> "SS"), 7

Re: Built-in CTYPE provider

2023-12-13 Thread Jeff Davis

On Tue, 2023-12-12 at 13:14 -0800, Jeremy Schneider wrote: > My biggest concern is around maintenance. Every year Unicode is > assigning new characters to existing code points, and those existing > code points can of course already be stored in old databases before > libs > are updated. Is the

Re: Built-in CTYPE provider

2023-12-12 Thread Jeremy Schneider

On 12/5/23 3:46 PM, Jeff Davis wrote: > === Character Classification === > > Character classification is used for regexes, e.g. whether a character > is a member of the "[[:digit:]]" ("\d") or "[[:punct:]]" > class. Unicode defines what character properties map into these > classes in TR #18 [1],

71 matches

Mail list logo